The overloaded_functor class template can be used to encompass multiple
lambdas accepting different types into a single callable object that can
be used with any of those types.
One application is visitors for std::variant where different handling is
required for different types.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This facilitates position_in_partition creation when parsing range tombstones bounds from SSTables files.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
This code will be re-used in promoted_index_blocks_parser to parse
clustering key prefixes from SSTables 3.x format.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
The previous name of the file is moreover confusing as we have several
sstable_assertions classes throughout tests but this header only
contains a class for index reader assertions.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
With this patch, index_reader is capable of reading index_entries from
both 'ka'/'la' and 'mc' formats.
Signed-off-by: Vladimir Krivopalov <vladimir@scylladb.com>
"
Queries that use secondary index and have a full partition key restriction
or full primary key restriction should not require filtering - it's
sufficient to add these restrictions to the index query.
This also adds secondary index tests to cover this case.
Tests: unit (release)
"
* 'si_and_pk_restrictions_2' of https://github.com/psarna/scylla:
tests: add index + partition key test
cql3: make index+primary key restrictions filtering-independent
cql3: use primary key restrictions in filtering index queries
cql3: add is_all_eq to primary key restrictions
cql3: add explicit conversion between key restrictions
cql3: add apply_to() method to single column restriction
cql3: make primary key restrictions' values unambiguous
Now that verb categorizations also affect scheduling, getting them
correct is more important. The first three patches in this series
improve the infrastructure a little, and the forth fixes some
categorization errors wrt. repair/streaming verbs.
* https://github.com/avikivity/scylla msg-idx-sanity/v1:
messaging: choose connection index via a look-up table
messaging: convert do_get_rpc_client_idx into a switch
messaging: remove default when computing rpc client index
messaging: categorize more streaming/repair verbs as streaming
For example, to bootstrap a 50th node in a cluster
[shard 0] range_streamer - Bootstrap with
[127.0.0.8, 127.0.0.2, 127.0.0.24, 127.0.0.21, 127.0.0.49, 127.0.0.44,
127.0.0.9, 127.0.0.7, 127.0.0.47, 127.0.0.15, 127.0.0.5, 127.0.0.30,
127.0.0.14, 127.0.0.12, 127.0.0.36, 127.0.0.11, 127.0.0.48, 127.0.0.28,
127.0.0.33, 127.0.0.10, 127.0.0.41, 127.0.0.4, 127.0.0.40, 127.0.0.3,
127.0.0.6, 127.0.0.43, 127.0.0.22, 127.0.0.26, 127.0.0.42, 127.0.0.25,
127.0.0.17, 127.0.0.37, 127.0.0.23, 127.0.0.13, 127.0.0.38, 127.0.0.1,
127.0.0.18, 127.0.0.20, 127.0.0.39, 127.0.0.27, 127.0.0.34, 127.0.0.32,
127.0.0.19, 127.0.0.16, 127.0.0.31, 127.0.0.45, 127.0.0.29, 127.0.0.35,
127.0.0.46]
for keyspace=keyspace1 started, nodes_to_stream=49, nodes_in_parallel=49
the new node will get data from 49 existing nodes.
Currently, it will stream from all the 49 existing nodes at the same
time. It is not a good idea to stream from all the nodes in parallel
which can overwhelm the bootstrap node, i.e., 49 nodes sending, 1 node
receiving.
To fix this, limit the nr of nodes to stream in parallel. We should have
a better control over the memory usage and parallelism. But for now,
limit the nr of nodes to a maximum of 16 as a starter. With this limit,
each shard can work with as many as 16 remote nodes in parallel, I think
this has enough parallelism for streaming in terms of performance.
This change have effect on the bootstrap/decommission/removenode node
operations, and do not have effect on repair.
Refs #2782
Message-Id: <980610dc97490d4f16281a0c3203b9bee73e04e4.1531989557.git.asias@scylladb.com>
"
Use chunked_vector instead of vector. It won't have compatibility issues
because chunked_vector and vector have the same on wire format.
Refs #278
"
* 'asias/gossip_memory_v2' of github.com:scylladb/seastar-dev:
gossip: Reduce continuous memory usage
to_string: Add std::list and utils::chunked_vector support
serializer: Add chunked_vector support
If full partition key (or full primary key) is used in an indexed
query, it should not require filtering, because queries like that
can be efficiently narrowed down with stricter index restrictions.
If both index and partition key is used in a query, it should not
require filtering, because indexed query can be narrowed down
with partition key information. This commit appends partition key
restrictions to index query.
"
The problem happens under the following circumstances:
- we have a partially populated partition in cache, with a gap in the middle
- a read with no clustering restrictions trying to populate that gap
- eviction of the entry for the lower bound of the gap concurrent with population
The population may incorrectly mark the range before the gap as continuous.
This may result in temporary loss of writes in that clustering range. The
problem heals by clearing cache.
Caught by row_cache_test::test_concurrent_reads_and_eviction, which has been
failing sporadically.
The problem is in ensure_population_lower_bound(), which returns true if
current clustering range covers all rows, which means that the populator has a
right to set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise, we're populating since _last_row and should
consult it.
Fixes#3608.
"
* 'tgrabiec/fix-violation-of-continuity-on-concurrent-read-and-eviction' of github.com:tgrabiec/scylla:
row_cache: Fix violation of continuity on concurrent eviction and population
position_in_partition: Introduce is_before_all_clustered_rows()
This series contains a couple of fixes to the bookkeeping of the view
build process, which could cause data to be left behind in the system
tables.
* git@github.com:duarten/scylla.git materialized-views/view-build-fixes/v1:
Duarte Nunes (3):
db/system_keyspace: Add function to remove view build status of a
shard
db/view: Don't have shard 0 clear other shard's status on drop
db/view: Restrict writes to the distributed system keyspace to shard 0
This series contains a couple of fixes to the adjusting of clustering
keys in the build_progress_virtual_reader, some of which could
potentially cause heap overflows when querying the legacy system table.
* git@github.com:duarten/scylla.git materialized-views/build-progress-virtual-reader-fixes/v1:
Duarte Nunes (3):
db/view/build_progress_virtual_reader: Use correct schema to adjust ck
db/view/build_progress_virtual_reader: Fix full ck detection
db/view/build_progress_virtual_reader: Also adjust end RT bound
"
This series fixes two issues related to bad_allocs and keys which require
linearization (larger than 12.8 KiB). With such keys, comparators may throw if
memory allocation fails. This may cause lookups in partition and rows trees to
fail with bad_alloc.
The first issue (#3583) was that partition version merging
(mutation_partition::apply_monotonically()) was not taking into account that
lookups may fail. If we fail, the partition which is being applied may be
incorrectly left with the clustering range since the begging of the range up
to the current row marked as continuous, if the current row has the continuity
flag set, because we've moved all of the preceding rows into the target, and
the correct lower bound row is no longer there in the source. This may mark
some discontinuous ranges as continuous. Merging is retried by
allocating_section, and there will be no problem if it eventually succeeds,
original continuity will be reflected in the sum. The problem will persist if
it doesn't eventually succeed, when we're really out of memory.
The user-perceivable effect of this would be temporary loss of writes in the
clustering range which was marked as continuous but shouldn't. Introduced in
2.2-rc1.
The second issue (#3585) is that the code which inserts partitions in memtable
and cache will leak the entry if boost::intrusive_set::insert() throws. This
will also cause SIGSEGV when cache tries to evict from such a leaked entry.
"
* tag 'tgrabiec/fix-bad-continuity-on-oom-in-apply-v2' of github.com:tgrabiec/scylla:
managed_bytes: Mark read_linearize() as an allocation point
tests: Relax expectation about continuity after failed merging
tests: mutation_partition: Verify continuity is consistent on bad_alloc on merging
tests: Switch to seastar's allocation failure injector
mutation_partition: Introduce set_continuity()
clustering_interval_set: Introduce contained_in()
clustering_interval_set: Introduce add() overload accepting another interval set
mutation_partition: Fix merging to not leave the source with broader continuity on bad_alloc
mutation_partition: Preserve continuity in case row merging with no tracker throws
memtable, cache: Fix exception safety of partition entry insertions
ensure_population_lower_bound() returned true if current clustering
range covers all rows, which means that the populator has a right to
set continuity flag to true on the row it inserts. This is correct
only if the current population range actually starts since before all
clustering rows. Otherwise we're populating since _last_row, and
should consult it.
The fix introduces a new flag, set when starting to populte, which
indicates if we're populating from the beginning of the range or
not. We cannot simply check if _last_row is set in
ensure_population_lower_bound() because _last_row can be set and then
become empty again.
Fixes#3608
Currently we check that the sum of continuities is exactly the same as
expected on failure. Relax this to require that continuity is not
broader, since in some bad_alloc scenarios, or preemption, we will
have to mark some ranges as discontinuous.
When clustering keys are larger than 12.8 KiB they may get fragmented
and key comparator will need to linearize them on comparison. This may
cause lookups in the rows tree to fail with bad_alloc. Partition
version merging (mutation_partition::apply_monotonically()) was not
taking this into account. If we fail on lookup, the partition which is
being applied may be incorrectly left with the clustering range since
the begging up to the current row marked as continuous, if the current
row has the continuity flag set, because we've moved all of the
preceding rows into the target, and the correct lower bound row is no
longer there in the source. This may mark some discontinuous ranges as
continuous.
Merging is retried by allocating_section, and there will be no problem
if it eventually suceeds, original continity will be reflected in the
sum. The problem will persist if it doesn't eventually succeed, when
we're really out of memory.
To protect against this, we could reset the continuity flag of the
current row in the source when exiting on exception.
Fixes#3583
Example:
p: row{key=A, cont=0} row{key=C, cont=1}
this: row{key=C, cont=0}
When we get to processing key=C, key=A was already moved to this, so p
has stale continuity on key=C, which marks (-inf,C) as continuous,
whereas it should mark only (A, C). That's not a problem if merging
succeeds, but if exception happens at this point, we will violate the
invariant which says that the sum of p and this should yield the same
logical partition. It wouldn't because continuity of the sum is
calculated as a set union, and (-inf, A) would be incorrectly turned
into a continuous range.
This is not a problem currently because continuity is always full when
there is no tracker (memtables), so won't change anyway, and when
there is a tracker (cache) we never merge but overwrite instead, so
there is no memory allocation and thus no possibility for failure. But
better be safe.
boost::intrusive::set::insert() may throw if keys require
linearization and that fails, in which case we will leak the entry.
When this happens in cache, we will also violate the invariant for
entry eviction, which assumes all tracked entries are linked, and
cause a SEGFAULT.
Use the non-throwing and faster insert_before() instead. Where we
can't use insert_before(), use alloc_strategy_unique_ptr<> to ensure
that entry is deallocated on insert failure.
Fixes#3585.
"
Partition keys are currently stored in serialized form in the
system.large_partitions table. This is an obstacle to operators
who usually can't deserialize partition keys in their heads.
Improve the situation by deserializing the partition key for them.
"
* tag 'pkey-print/v1' of https://github.com/avikivity/scylla:
large_partition_handler: output friendly partition key
keys: schema-aware printing of a partition_key
* seastar aac6cf1...6b97e00 (5):
> Merge "changes to fix travis CI builds" from Kefu
> tls.cc: Make "close" timeout delay exception proof
> core/sharded: mark foreign_ptr::get_owner_shard() const
> core/memory: Expose counter of large allocations
> tests: add test for multi-fragmented net::packet
Fixes#3461.
Ref scylladb/seastar#474.