scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-02 06:05:53 +00:00

Author	SHA1	Message	Date
Avi Kivity	98cdeaf0f2	schema_tables: make the_merge_lock thread_local the_merge_lock is global, which is fine now because it is only used in shard 0. However, if we run multiple nodes in a single process, there will be multiple shard 0:s, and the_merge_lock will be accessed from multiple threads. This won't work. To fix, make it thread_local. It would be better to make it a member of some controlling object, but there isn't one. Closes #8858	2021-06-17 13:41:11 +03:00
Avi Kivity	00ff3c1366	Merge 'treewide: add support for snapshot skip-flush option' from Benny Halevy The option is provided by nodetool snapshot https://docs.scylladb.com/operating-scylla/nodetool-commands/snapshot/ ``` nodetool [(-h <host> \| --host <host>)] [(-p <port> \| --port <port>)] [(-pp \| --print-port)] [(-pw <password> \| --password <password>)] [(-pwf <passwordFilePath> \| --password-file <passwordFilePath>)] [(-u <username> \| --username <username>)] snapshot [(-cf <table> \| --column-family <table> \| --table <table>)] [(-kc <kclist> \| --kc.list <kclist>)] [(-sf \| --skip-flush)] [(-t <tag> \| --tag <tag>)] [--] [<keyspaces...>] -sf / –skip-flush Do not flush memtables before snapshotting (snapshot will not contain unflushed data) ``` But is currently ignored by scylla-jmx (scylladb/scylla-jmx#167) and not supported at the api level. This patch adds support for the option in advance from the api service level down via snapshot_ctl to the table class and snapshot implementation. In addition, a corresponding unit test was added to verify that taking a snapshot with `skip_flush` does not flush the memtable (at the table::snapshot level). Refs #8725 Closes #8726 * github.com:scylladb/scylla: test: database_test: add snapshot_skip_flush_works api: storage_service/snapshots: support skip-flush option snapshot: support skip_flush option table: snapshot: add skip_flush option api: storage_service/snapshots: add sf (skip_flush) option	2021-06-17 13:32:23 +03:00
Nadav Har'El	7fd7e90213	cql-pytest: translate Cassandra's tests for static columns This is a translation of Cassandra's CQL unit test source file validation/entities/StaticColumnsTest.java into our our cql-pytest framework. This test file checks various features of static columns. All these tests pass on Cassandra, and all but one pass on Scylla. The xfailing test, testStaticColumnsWithSecondaryIndex, exposes a query that Cassandra allows but we don't. The new issue about that is: Refs #8869. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616141633.114325-1-nyh@scylladb.com>	2021-06-17 11:08:28 +02:00
Nadav Har'El	b6b4df9a47	heat-weighted load balancing: improve handling of near-perfect cache Consider two nodes with almost-100% cache hit ratio, but not exactly 100%: one has 99.9% cache hits, the second 99.8%. Normally in HWLB we want to equalize the miss rate in both nodes. So we send the first node twice the number of requests we send to the second. But unless the disks are extremely limited, this doesn't make sense: As a numeric example, consider that we send 2000 requests to the first node and 1000 to the second, just so the number of misses will be the same - 2 (0.1% and 0.2% misses, respectively). At such low miss numbers, the assumption that the disk reads are the slowest part of the operation is wrong, so trying to equalize only this part is wrong. So above some threshold hit rate, we should treat all hit rates as equivalent. In the code we already had such a threshold - max_hit_rate, but it was set to the incredibly high 0.999. We saw in actual user runs (see issue #8815) that this threshold was too high - one node received twice the amount of requests that another did - although both had near-100% cache hit rates. So in this patch we lower the max_hit_rate to 0.95. This will have two consequences: 1. Two nodes with hit rates above 0.95 will be considered to have the same hit rate, so they will get equal amount of work - even if one has hit rate 0.98 and the other 0.99. 2. A cold node with it rate 0.0 will get 5% of the work of a node with the perfect hit rate limited to 0.95. This will allow the cold node to slowly warm up its cache. Before this patch, if the hot node happened to have a hit rate of 0.999 (the previous maximum), the cold node would get just 0.1% of the work and remain almost idle and fill its cache extremely slowly - which is a waste. Fixes #8815. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210616180732.125295-1-nyh@scylladb.com>	2021-06-17 11:02:08 +02:00
Avi Kivity	3c21833aac	cql3: expr: make column_value (and similar) a first-class expression Currently, column names can only appear in a boolean binary expression, but not on their own. This means that in the statement SELECT a FROM tab WHERE a > 3; We can represent the WHERE clause as an expression, but not the selector. To pave the way for using expressions in selector contexts, we promote the elements of binary_operator::lhs (column_value, column_value_tuple, token) to be expressions in their own right. binary_operator::lhs becomes an expression (wrapped in unique_ptr, because variants can't contain themselves). Note that all three new possibilities make sense in a selector: SELECT column FROM tab SELECT token(pk) FROM tab SELECT function_that_accepts_a_tuple((col1, col2)) FROM tab There is some fallout from this: - because binary_operator contains a unique_ptr, it is no longer copyable. We add a copy constructor and assignment operator to compensate. - often, the new elements don't make sense when evaluating a boolean expression, which is the only context we had before. We call on_internal_error in these cases. The parser right now prevents such cases from being constructed in the first place (this is equivalent to if (some_struct_value) in C). - in statement_restrictions.cc, we need to evalute the lhs in the context of the full binary operator. I introduced with_current_binary_operator() for this; an alternative approach is to create a new sub-visitor. Closes #8797	2021-06-17 10:08:58 +03:00
Tomasz Grabiec	6bdf8c4c46	Merge "raft: second series of preparatory patches for group 0 discovery" from Kostja Miscellaneous preparatory patches for group 0 discovery. * scylla-dev/raft-group-0-part-2-v4: raft: (service) servers map is gid -> server, not sid -> server system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID raft: (server) wait for configuration transition to complete raft: (server) implement raft::server::get_configuration() raft: (service) don't throw from schema state machine raft: (service) permit some scylla.raft cells to be empty raft: (service) properly handle failure to add a server raft: implement is_transient_error()	2021-06-17 00:15:40 +02:00
Asias He	7a32cab524	gossip: Fix use-after-free in real_mark_alive and mark_dead In commit `11a8912093` (gossiper: get_gossip_status: return string_view and make noexcept) get_gossip_status returns a pointer to an endpoint_state in endpoint_state_map. After commit `425e3b1182` (gossip: Introduce direct failure detector), gossiper::mark_dead and gossiper::real_mark_alive can yield in the middle of the function. It is possible that endpoint_state can be removed, causing use-after-free to access it. To fix, make a copy before we yield. Fixes #8859 Closes #8862	2021-06-16 21:16:26 +02:00
Konstantin Osipov	18e3fcdbf1	raft: (service) servers map is gid -> server, not sid -> server Raft Group registry should map Raft Group Id to Raft Server, not Raft Server ID (which is identical for all groups) to Raft server. Raft Group 0 ID works as a cluster identifier, so is generated when a new cluster is created and is shared by all nodes of the same cluster. Implement a helper to get raft::server by group id. Consistently throw a new raft_group_not_found exception if there is no server or rpc for the specified group id.	2021-06-16 19:05:50 +03:00
Avi Kivity	f05ddf0967	Merge "Improve LSA descriptor encoding" from Pavel " The LSA small objects allocation latency is greatly affected by the way this allocator encodes the object descriptor in front of each allocated slot. Nowadays it's one of VLE variants implemented with the help of a loop. Re-implementing this piece with less instructions and without a loop allows greatly reducing the allocation latency. The speed-up mostly comes from loop-less code that doesn't confuse branch predictor. Also the express encoder seems to benefit from writing 8 bytes of the encoded value in one go, rather than byte- -by-byte. Perf measurements: 1. (new) logallog test shows ~40% smaller times 2. perf_mutation in release mode shows ~2% increase in tps 3. the encoder itself is 2 - 4 times faster on x86_64 and 1.05 - 3 times faster on aarch64. The speed-up depends on the 'encoded length', old encoder has linear time, the new one is constant tests: unit(dev), perf(release), just encoder on Aarch64 " * 'br-lsa-alloc-latency-4' of https://github.com/xemul/scylla: lsa: Use express encoder uleb64: Add express encoding lsa: Extract uleb64 code into header test: LSA allocation perf test	2021-06-16 18:07:13 +03:00
Pavel Emelyanov	8d0780fb92	lsa: Use express encoder To make it possible to use the express encoder, lsa needs to make sure that the value is below express supreme value and provide the size of the gap after the encoded value. Both requirements can be satisfied when encoding the migrator index on object allocation. On free the encoded value can be larger, so the extended express encoder will need more instructions and will not be that efficient, so the old encoder is used there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:12 +03:00
Pavel Emelyanov	1782b0c6b9	uleb64: Add express encoding Standard encoding is compiled into a loop that puts values into memory byte-by-byte. This works slowly, but reliably. When allocating an object LSA uses ubel64 encoder with 2 features that allow to optimize the encoder: 1. the value is migrator.index() which is small enough to fit 2 bytes when encoded 2. After the descriptor there usually comes an object which is of 8+ bytes in size Feature #1 makes it possible to encode the value with just a few instructions. In O3 level clang makes it like mov %esi,%ecx and $0xfc0,%ecx and $0x3f,%esi lea (%rsi,%rcx,4),%ecx add $0x40,%ecx Next, the encoder needs to put the value into a gap whose size depends on the alignment of previous and current objects, so the classical algo loops through this size. Feature #2 makes it possible to put the encoded value and the needed amount of zeros by using 2 64-bit movs. In this case the encoded value gets off the needed size and overwrites some memory after. That's OK, as this overwritten memory is where the allocated object _will_ be, so the contents there is not of any interest. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:47:10 +03:00
Pavel Emelyanov	d8dea48248	lsa: Extract uleb64 code into header The LSA code encodes an object descriptor before the object itself. The descriptor is 32-bit value and to put it in an efficient manner it's encoded into unsigned little-endian base-64 sequence. The encoding code is going to be optimized, so put it into a dedicated header in advance. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 17:46:44 +03:00
Avi Kivity	0948908502	Merge "mutation_reader: multishard_combining_reader clean-up close path" from Botond " The close path of the multishard combining reader is riddled with workarounds the fact that the flat mutation reader couldn't wait on futures when destroyed. Now that we have a close() method that can do just that, all these workarounds can be removed. Even more workarounds can be found in tests, where resources like the reader concurrency semaphore are created separately for each tested multishard reader and then destroyed after it doesn't need it, so we had to come up with all sorts of creative and ugly workarounds to keep these alive until background cleanup is finished. This series fixes all this. Now, after calling close on the multishard reader, all resources it used, including the life-cycle policy, the semaphores created by it can be safely destroyed. This greatly simplifies the handling of the multishard reader, and makes it much easier to reason about life-cycle dependencies. Tests: unit(dev, release:v2, debug:v2, mutation_reader_test:debug -t test_multishard, multishard_mutation_query_test:debug, multishard_combining_reader_as_mutation_source:debug) " * 'multishard-combining-reader-close-cleanup/v3' of https://github.com/denesb/scylla: mutation_reader: reader_lifecycle_policy: remove convenience methods mutation_reader: multishard_combining_reader: store shard_reader via unique ptr test/lib/reader_lifecycle_policy: destroy_reader: cleanup context test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore test/lib/reader_lifecycle_policy: use a more robust eviction mechanism reader_concurrency_semaphore: wait for all permits to be destroyed in stop() test/lib/reader_lifcecycle_policy: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard reader_lifecycle_policy implementations: fix indentation mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter mutation_reader: shard_reader::close(): wait on the remote reader multishard_mutation_query: destroy remote parts in the foreground mutation_reader: shard_reader::close(): close _reader mutation_reader: reader_lifcecycle_policy::destroy_reader(): remove out-of-date comment	2021-06-16 17:25:50 +03:00
Konstantin Osipov	9c93d77e74	system_keyspace: raft.group_id and raft_snapshots.group_id are TIMEUUID Fix a bug in definitions of system.raft, system.raft_snapshots, group_id is TIMEUUID, not long.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	c67c77ed03	raft: (server) wait for configuration transition to complete By default, wait for the server to leave the joint configuration when making a configuration change. When assembling a fresh cluster Scylla may run a series of configuration changes. These changes would all go through the same leader and serialize in the critical section around server::cas(). Unless this critical section protects the complete transition from C_old configuration to C_new, after the first configuration is committed, the second may fail with exception that a configuration change is in progress. The topology changes layer should handle this exception, however, this may introduce either unpleasant delays into cluster assembly (i.e. if we sleep before retry), or a busy-wait/thundering herd situation, when all nodes are retrying their configuration changes. So let's be nice and wait for a full transition in server::set_configuration().	2021-06-16 16:52:43 +03:00
Konstantin Osipov	631c89e1a6	raft: (server) implement raft::server::get_configuration() raft::server::set_configuration() is useless on application level if we can't query the previous configuration.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	867440f080	raft: (service) don't throw from schema state machine It's now started as Scylla starts, and state machine failure leads to panic at start.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	845ff9f344	raft: (service) permit some scylla.raft cells to be empty When loading raft state from scylla.raft, permit some cells to be empty. Indeed, the server is not obliged to persist all vote, term, snapshot once it starts. And the log can be empty.	2021-06-16 16:52:43 +03:00
Konstantin Osipov	b8fa6c6e9c	raft: (service) properly handle failure to add a server future.get() is not available outside thread context and co_await is not available inside catch (...) block.	2021-06-16 16:47:11 +03:00
Konstantin Osipov	73c59865f7	raft: implement is_transient_error() Add a helper to classify Raft exceptions as transient.	2021-06-16 16:26:31 +03:00
Pavel Emelyanov	1e67361267	test: LSA allocation perf test The test measures the time it takes to allocate a bunch of small objects on LSA inside single segment. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-06-16 13:40:44 +03:00
Botond Dénes	b4e69cf63d	test/lib/test_utils: require(): also log failed conditions Currently `require()` throws an exception when the condition fails. The problem with this is that the error is only printed at the end of the test, with no trace in the logs on where exactly it happened, compared to other logged events. This patchs also adds an error-level log line to address this. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210616065711.46224-1-bdenes@scylladb.com>	2021-06-16 12:05:25 +03:00
Botond Dénes	28c2b54875	mutation_reader: reader_lifecycle_policy: remove convenience methods These convenience methods are not used as much anymore and they are not even really necessary as the register/unregister inactive read API got streamlined a lot to the point where all of these "convenience methods" are just one-liners, which we can just inline into their few callers without loosing readability.	2021-06-16 11:29:37 +03:00
Botond Dénes	63f0839164	mutation_reader: multishard_combining_reader: store shard_reader via unique ptr No need for a shared pointer anymore, as we don't have to potentially keep the shard reader alive after the multishard reader is destroyed, we now do proper cleanup in close(). We still need a pointer as the shard reader is un-movable but is stored in a vector which requires movable values.	2021-06-16 11:29:37 +03:00
Botond Dénes	a69db31b5c	test/lib/reader_lifecycle_policy: destroy_reader: cleanup context Now that we don't rely on any external machinery to keep the relevant parts of the context alive until needed as its life-cycle is effectively enclosed in that of the life-cycle policy itself, we can cleanup the context in `destroy_reader()` itself, avoiding a background trip back to this shard.	2021-06-16 11:29:36 +03:00
Botond Dénes	d2ddaced4e	test/lib/reader_lifecycle_policy: get rid of lifecycle workarounds The lifecycle of the reader lifecycle policy and all the resources the reads use is now enclosed in that of the multishard reader thanks to its close() method. We can now remove all the workarounds we had in place to keep different resources as long as background reader cleanup finishes.	2021-06-16 11:29:36 +03:00
Botond Dénes	5a271e42a5	test/lib/reader_lifecycle_policy: destroy_reader(): stop the semaphore So that when this method returns the semaphore is safe to destroy. This in turn will enable us to get rid of all the machinery we have in place to deal with the semaphore having to out-live the lifecycle policy without a clear time as to when it can be safe to destroy.	2021-06-16 11:29:36 +03:00
Botond Dénes	c09c62a0fb	test/lib/reader_lifecycle_policy: use a more robust eviction mechanism The test reader lifecycle policy has a mode in which it wants to ensure all inactive readers are evicted, so tests can stress reader recreation logic. For this it currently employs a trick of creating a waiter on the semaphore. I don't even know how this even works (or if it even does) but it sure complicates the lifecycle policy code a lot. So switch to the much more reliable and simple method of creating the semaphore with a single count and no memory. This ensures that all inactive reads are immediately evicted, while still allows a single read to be admitted at all times.	2021-06-16 11:29:36 +03:00
Botond Dénes	578a092e4a	reader_concurrency_semaphore: wait for all permits to be destroyed in stop() To prevent use-after-free resulting from any permit out-living the semaphore.	2021-06-16 11:29:36 +03:00
Botond Dénes	a10a6e253e	test/lib/reader_lifcecycle_policy: fix indentation Left broken from the previous patch.	2021-06-16 11:29:36 +03:00
Botond Dénes	8c7447effd	mutation_reader: reader_lifecycle_policy::destroy_reader(): require to be called on native shard Currently shard_reader::close() (its caller) goes to the remote shard, copies back all fragments left there to the local shard, then calls `destroy_reader()`, which in the case of the multishard mutation query copies it all back to the native shard. This was required before because `shard_reader::stop()` (`close()`'s) predecessor) couldn't wait on `smp::submit_to()`. But close can, so we can get rid of all this back-and-forth and just call `destroy_reader()` on the shard the reader lives on, just like we do with `create_reader()`.	2021-06-16 11:29:35 +03:00
Avi Kivity	c3838cbc3b	Merge 'Make calculating affected ranges yieldable' from Piotr Sarna This series partially addresses #8852 and its problems caused by deleting large partitions from tables with materialized views. The issue in question is not fixed by this series, because a full fix requires a more complex rewrite of the view update mechanism. This series makes calculating affected clustering ranges for materialized view updates more resilient to large allocations and stalls. It does so by futurizing the function which can potentially involve large computations and makes it use non-contiguous storage instead of std::vector to avoid large allocations. Tests: unit(release) Closes #8853 * github.com:scylladb/scylla: db,view,table: futurize calculating affected ranges table: coroutinize do_push_view_replica_updates db,view: use chunked vector for view affected ranges interval: generalize deoverlap()	2021-06-16 11:26:49 +03:00
Botond Dénes	4ecf061c90	reader_lifecycle_policy implementations: fix indentation Left broken from the previous patch.	2021-06-16 11:21:38 +03:00
Botond Dénes	a7e59d3e2c	mutation_reader: reader_lifecycle_policy::destroy_reader(): de-futurize reader parameter The shard reader is now able to wait on the stopped reader and pass the already stopped reader to `destroy_reader()`, so we can de-futurize the reader parameter of said method. The shard reader was already patched to pass a ready future so adjusting the call-site is trivial. The most prominent implementation, the multishard mutation query, can now also drop its `_dismantling_gate` which was put in place so it can wait on the background stopping if readers. A consequence of this move is that handling errors that might happen during the stopping of the reader is now handled in the shard reader, not all lifecycle policy implementations.	2021-06-16 11:21:38 +03:00
Botond Dénes	13d7806b62	mutation_reader: shard_reader::close(): wait on the remote reader We now have a future<> returning close() method so we don't need to do the cleanup of the remote reader in the background, detaching it from the shard-reader under destruction. We can now wait for the cleanup properly before the shard reader is destroyed and just pass the stopped reader to reader_lifecycle_policy::destroy_reader(). This patch does the first part -- moving the cleanup to the foreground, the API change of said method will come in the next patch.	2021-06-16 11:21:38 +03:00
Botond Dénes	ab8d2a04a5	multishard_mutation_query: destroy remote parts in the foreground Currently the foreign fields of the reader meta are destroyed in the background via the foreign pointer's destructor (with one exception). This makes the already complicated life-cycle of these parts and their dependencies even harder to reason about, especially in tests, where even things like semaphores live only within the test. This patch makes sure to destroy all these remote fields in the foreground in either `save_reader()` or `stop()`, ensuring that once `stop()` returns, everything is cleaned up.	2021-06-16 11:21:38 +03:00
Botond Dénes	7552cc73cf	mutation_reader: shard_reader::close(): close _reader The reason we got away without closing _reader so far is that it is an `std::unique_ptr<evictable_reader>` which is a `flat_mutation_reader::impl` instance, without the `flat_mutation_reader` wrapper, which contains the validations for close.	2021-06-16 11:21:33 +03:00
Avi Kivity	fce124bd90	Merge "Introduce flat_mutation_reader_v2" from Tomasz " This series introduces a new version of the mutation fragment stream (called v2) which aims at improving range tombstone handling in the system. When compacting a mutation fragment stream (e.g. for sstable compaction, data query, repair), the compactor needs to accumulate range tombstones which are relevant for the yet-to-be-processed range. See range_tombstone_accumulator. One problem is that it has unbounded memory footprint because the accumulator needs to keep track of all the tombstoned ranges which are still active. Another, although more benign, problem is computational complexity needed to maintain that data structure. The fix is to get rid of the overlap of range tombstones in the mutation fragment stream. In v2 of the stream, there is no longer a range_tombstone fragment. Deletions of ranges of rows within a given partition are represented with range_tombstone_change fragments. At any point in the stream there is a single active clustered tombstone. It is initially equal to the neutral tombstone when the stream of each partition starts. The range_tombstone_change fragment type signify changes of the active clustered tombstone. All fragments emitted while a given clustered tombstone is active are affected by that tombstone. Like with the old range_tombstone fragments, the clustered tombstone is independent from the partition tombstone carried in partition_start. The memory needed to compact a stream is now constant, because the compactor needs to only track the current tombstone. Also, there is no need to expire ranges on each fragment because the stream emits a fragment when the range ends. This series doesn't convert all readers to v2. It introduces adaptors which can convert between v1 and v2 streams. Each mutation source can be constructed with either v1 or v2 stream factory, but it can be asked any version, performing conversion under the hood if necessary. In order to guarantee that v1 to v2 conversion produces a well-formed stream, this series needs to impose a constraint on v1 streams to trim range tombstones to clustering restrictions. Otherwise, v1->v2 converted could produce range tombstone changes which lie outside query restrictions, making the stream non-canonical. The v2 stream is strict about range tombstone trimming. It emits range tombstone changes which reflect range tombstones trimmed to query restrictions, and fast-forwarding ranges. This makes the stream more canonical, meaning that for a given set of writes, querying the database should produce the same stream of fragments for a given restrictions. There is less ambiguity in how the writes are represented in the fragment stream. It wasn't the case with v1. For example, A given set of deletions could be produced either as one range_tombstone, or may, split and/or deoverlapped with other fragments. Making a stream canonical is easier for diff-calculating. The mc sstable reader was converted to v2 because it seemed like a comparable effort to do that versus implementing range tombstone trimming in v1. The classes related to mutation fragment streams were cloned: flat_mutation_reader_v2, mutation_fragment_v2, related concepts. Refs #8625. To fully fix #8625 we need to finish the transition and get rid of the converters. Converters accumulate range tombstones. Tests: - unit [dev] " * tag 'flat_mutation_reader_range_tombstone_split-v3.2' of github.com:tgrabiec/scylla: (26 commits) tests: mutation_source_test: Run tests with conversions inserted in the middle tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests() tests: Add tests for flat_mutation_reader_v2 flat_mutation_reader: Update the doc to reflect range tombstone trimming sstables: Switch the mx reader to flat_mutation_reader_v2 row_cache: Emit range tombstone adjacent to upper bound of population range tests: sstables: Fix test assertions to not expect more than they should flat_mutation_reader: Trim range tombstones in make_flat_mutation_reader_from_fragments() clustering_ranges_walker: Emit range tombstone changes while walking tests: flat_mutation_reader_assertions_v2: Adapt to the v2 stream Clone flat_reader_assertions into flat_reader_assertions_v2 test: lib: simple_schema: Reuse new_tombstone() test: lib: simple_schema: Accept tombstone in delete_range() mutation_source: Introduce make_reader_v2() partition_snapshot_flat_reader: Trim range tombstones to query ranges mutation_partition: Trim range tombstones to query ranges sstables: reader: Inline specialization of sstable_mutation_reader sstables: k_l: reader: Trim range tombstones to query ranges clustering_ranges_walker: Introduce split_tombstone() position_range: Introduce contains() check for ranges ...	2021-06-16 11:10:54 +03:00
Piotr Sarna	f832a30388	db,view,table: futurize calculating affected ranges In order to avoid stalls on large inputs, calculating affected ranges is now able to yield.	2021-06-16 09:51:31 +02:00
Piotr Sarna	e3fa0246a1	table: coroutinize do_push_view_replica_updates Makes the code cleaner, but more importantly it will make it easier to futurize calculate_affected_clustering_ranges in the near future.	2021-06-16 09:51:30 +02:00
Avi Kivity	44f3ad836b	main: use correct max-io-requests option spelling We check for the existence of the option using one spelling, then read it using another, so we crash with bad_lexical_cast if it's present when casting the empty string to unsigned. Fix by using the correct spelling. Closes #8866	2021-06-16 09:35:05 +02:00
Tomasz Grabiec	605a6e0166	Merge "Remove int_or_strong_ordering concept" from Pavel The one was added to smothly switch tri-comparing stuff from int to strong-ordering. As for today only tests still need it and the conversion is pretty simple, plus operator<<(ostream&) for the std::strong_ordering type. * xemul/br-remove-int-or-strong-ordering-2: util: Drop int_or_strong_ordering concept tests: Switch total-order-check onto strong_ordering to_string: Add formatter for strong_ordering tests: Return strong-ordering from tri-comparators	2021-06-16 09:34:49 +02:00
Botond Dénes	114459684b	mutation_reader: foreign_reader::close() use on_internal_error_noexcept() Instead of the throwing on_internal_error(). `close()` is noexcept so we can't throw exceptions here. Signed-off-by: Botond Dénes <bdenes@scylladb.com> Message-Id: <20210615133130.786048-1-bdenes@scylladb.com>	2021-06-16 09:34:49 +02:00
Asias He	11959173a4	storage_service: Add node_ops_cmd_heartbeat_updater helper Multiple node operations use a similar heart beat update logic. Add a helper to reduce the code duplication. Fixes #8825 Closes #8826	2021-06-16 09:34:49 +02:00
Gleb Natapov	580edcef27	raft: register metrics only after fsm is created Metrics access _fsm pointer, so we should register them only after the pointer is populated. Fixes: #8824 Message-Id: <YMilsCslLAeEnbaw@scylladb.com>	2021-06-16 09:34:49 +02:00
Asias He	c2cfdcd345	gossiper: Set minimum value for quarantine_delay When a new node bootstraps to join the cluster, it will be set in bootstrap gossip status. If the node is gone in the middle, the node will be removed by gossip after the new node fails to update gossip after fat_client_timeout, which reverts the new node as pending node. However, if the new node is slow to update gossip and it finishes bootstrapping after existing nodes have removed the new node after fat_client_timeout. In handle_state_normal handler, the existing nodes will fail to find the host id for the new node and throw and in turn terminate the scylla process. To mitigate the problem, we set fat_client_timeout which is half of quarantine_delay to a minimum value if users set a small ring_delay value. Refs #8702 Refs #8859 Closes #8860	2021-06-16 09:34:49 +02:00
Tomasz Grabiec	3fcd1f43ba	tests: mutation_source_test: Run tests with conversions inserted in the middle	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	cddcba27de	tests: mutation_source_tests: Unroll run_flat_mutation_reader_tests() All readers are now flat so there is no need for this grouping. Will be needed for the next patch, which needs a single function with all test cases.	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	ffb616fef6	tests: Add tests for flat_mutation_reader_v2	2021-06-16 00:23:49 +02:00
Tomasz Grabiec	3deaa15751	flat_mutation_reader: Update the doc to reflect range tombstone trimming	2021-06-16 00:23:49 +02:00

1 2 3 4 5 ...

27009 Commits