scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-12 19:02:12 +00:00

Author	SHA1	Message	Date
Nadav Har'El	414b672e22	test/alternator: verify that empty-string keys are NOT allowed Since May 2020 empty strings are allowed in DynamoDB as attribute values (see announcment in [1]). However, they are still not allowed as keys. We had tests that they are not allowed in keys of LSI or GSI, but missed tests that they are not allowed as keys (partition or sort key) of base tables. This patch add these missing tests. These tests pass - we already had code that checked for empty keys and generated an appropriate error. Note that for compatibility with DynamoDB, Alternator will forbid empty strings as keys even though Scylla does support this possibility (Scylla always supported empty strings as clustering key, and empty partition keys will become possible with issue #9352). [1] https://aws.amazon.com/about-aws/whats-new/2020/05/amazon-dynamodb-now-supports-empty-values-for-non-key-string-and-binary-attributes-in-dynamodb-tables/ Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20211003122842.471001-1-nyh@scylladb.com>	2021-10-04 08:40:43 +02:00
Michał Radwański	0d5a2067ad	test/lib/failure_injecting_allocation_strategy: remove UB... by setting _alloc_count initially to 0. The _alloc_count hasn't been explicitely specified. As the allocator has been usually an automatic variable, _alloc_count had initially some unspecified contents. This probalby means that cases where the first few allocations passed and the later one failed, might haven't ever been tested. Good thing is that most of the users have been transferred to the Seastar failure injector, which (by accident) has been correct. Closes #9420	2021-10-01 13:25:05 +02:00
Eliran Sinvani	c38ceafdcf	Service Level Controller: Add an extention point to the API (#9374 ) In order to ease future extensions to the information being sent by the service level configuration change API, we pack the additional parameters (other the the service level options) to the interface in a structure. This will allow an easy expansion in the future if more parameters needs to be sent to the observer.i Signed-off-by: Eliran Sinvani <eliransin@scylladb.com>	2021-10-01 10:20:28 +03:00
Nadav Har'El	1edcc3a218	test/alternator: add test for reverse queries This patch adds a reproducer for issue #7586 - that Alternator queries (Query) operating in reverse order (ScanIndexForward = false) are artificially limited to 100 MB partitions because of their memory use. This test generates a partition over 100 MB in size and then tries various reverse queries on it - with or without Limit, starting at the end or the middle of the partition. The test currently fails when a reverse query refuses to operate on such a large partition - the log reports this: ERROR ... Memory usage of reversed read exceeds hard limit of 104857600 (configured via max_memory_for_unlimited_query_hard_limit), while reading partition K1H6ON3A1C With yet-uncommitted reverse-scan improvements, the test proceeds further, but still fails where we test that a reverse query with Limit not explicitly specified should still be limited to a certain size (e.g. 1MB) and cannot return the entire 100 MB partition in one response. Please note that this is not a comprehensive test for Scylla's reverse scan implementation: In particular we do not have separate tests for reverse scan's implementation on different sources - memtables, sstables, or the cache. Nor do we check all sorts of edge cases. We assume that Scylla's reverse scan implementation will have its own unit tests elsewhere that will check these things - and this test can focus on the Alternator use case. This test is marked "xfail" because it still fails on Alternator. It is marked "veryslow" because it's a (relatively) slow test, taking multiple seconds to set up the 100 MB partition. So run the test with the pytest options "--runxfail --runveryslow" to see how it fails. Refs #7586 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210930063700.407511-1-nyh@scylladb.com>	2021-09-30 09:34:39 +02:00
Pavel Emelyanov	e6b920017a	main: Replace cql_config_updater with updateable_value The cql_config_updater is a sharded<> service that exists in main and whose goal is to make sure some db::config's values are propagated into cql_config. There's a more handy updateable_value<> glue for that. tests: unit(dev) refs: #2795 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210927090402.25980-1-xemul@scylladb.com>	2021-09-30 07:23:43 +03:00
Botond Dénes	970fe9a339	mutation_writer: partition_based_splitting_writer: limit number of max buckets Recently we observed an OOM caused by the partition based splitting writer going crazy, creating 1.7K buckets while scrubbing an especially broken sstable. To avoid situations like that in the future, this patch provides a max limit for the number of live buckets. When the number of buckets reach this number, the largest bucket is closed and replaced by a bucket. This will end up creating more output sstables during scrub overall, but now they won't all be written at the same time causing insane memory pressure and possibly OOM. Scrub compaction sets this limit to 100, the same limit the TWCS's timestamp based splitting writer uses (implemented through the classifier - time_window_compaction_strategy::max_data_segregation_window_count). Fixes: #9400 Tests: unit(dev) Closes #9401	2021-09-29 16:31:29 +03:00
Avi Kivity	b3c95a1fc6	commitlog: reduce inclusions of commitlog.hh due to db::commitlog::force_sync (#9379 ) There are now 231 translation units that indirectly include commitlog.hh due to the need to have access to db::commitlog::force_sync. Move that type to a new file commitlog_types.hh and make it available without access to the commitlog class. This reduces the number of translation units that depend on commitlog.hh to 84, improving compile time.	2021-09-29 16:13:44 +03:00
Nadav Har'El	5cbe9178fd	alternator: add missing BatchGetItem metric Unfortunately, defining metrics in Scylla requires some code duplication, with the metrics declared in one place but exported in a different place in the code. When we duplicated this code in Alternator, we accidentally dropped the first metric - for BatchGetItem. The metric was accounted in the code, but not exported to Prometheus. In addition to fixing the missing metric, this patch also adds a test that confirms that the BatchGetItem metric increases when the BatchGetItem operation is used. This test failed before this patch, and passes with it. The test only currently tests this for BatchGetItem (and BatchWriteItem) but it can be later expanded to cover all the other operations as well. Fixes #9406 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210929121611.373074-1-nyh@scylladb.com>	2021-09-29 14:16:54 +02:00
Tomasz Grabiec	11a3b411c5	Merge 'mutation_source_test: test reverse reads' from Botond Dénes Currently no mutation-source supports reading in reverse natively but we are working on changing that, adding native reverse read support to memtable, cache and sstable readers. To ensure that all mutation sources work in a correct and uniform manner when reading in reverse, we add a reverse test to the mutation source test suite. This test reverses the data that it passes to `populate()`, then reads in forward order (in reverse compared to the data order). For this we use the currently established reverse read API: reverse schema (schema order == query order) and half-reversed (legacy) slice. All mutation sources are prepared to work with reversed reads, using the `make_reversing_reader()` adapter. As we progress with our native reverse support, we will replace these adapters with native reversing support. As part of this, we push down the reversing reader adapter currently existing on the `query::consume_page()` level, to the individual mutation sources. Closes #9384 * github.com:scylladb/scylla: test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set querier: consume_page(): remove now unused max_size parameter test/lib: mutation_source_test: test reading in reverse test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema treewide: move reversing to the mutation sources mutation_query: reconcilable_result_builder: document reverse query preconditions sstable_set: time_series_sstable_set: reverse mode mutlishard_mutation_query: set max result size on used permits db/virtual_table: streaming_virtual_table::as_mutation_source(): use query schema instead of table schema flat_mutation_reader: make_reversing_reader(): add convenience stored slice mutation_reader: evictable_reader: add reverse read support flat_mutation_reader: make_flat_mutation_reader_from_fragments(): add reverse read support flat_mutation_reader: flat_mutation_reader_from_mutations(): add reverse read support flat_mutation_reader: flat_mutation_reader_from_mutations(): document preconditions query-request: introduce `half_reverse_slice` flat_mutation_reader_assertions: log what's expected	2021-09-29 12:57:57 +02:00
Avi Kivity	d4aa6c2746	Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael " Backlog tracker isn't updated correctly when facing a schema change, and may leak a SSTable if compaction strategy is changed, which causes backlog to be computed incorrectly. Most of these problems happen because sstable set and tracker are updated independently, so it could happen that tracker lose track (pun intended) of changes applied to set. The first patch will fix the leak when strategy is changed, and the third patch will make sure that tracker is updated atomically with sstable set, so these kind of problems will not happen anymore. Fixes #9157 " * 'fixes_to_backlog_tracker_v4' of github.com:raphaelsc/scylla: compaction: Update backlog tracker correctly when schema is updated compaction: Don't leak backlog of input sstable when compaction strategy is changed compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() compaction: simplify removal of monitors	2021-09-29 13:55:37 +03:00
Kamil Braun	075a894a89	test: mutation_reader_test: reversed version of test_clustering_order_merger_sstable_set	2021-09-29 12:15:48 +03:00
Botond Dénes	42b677ef6f	querier: consume_page(): remove now unused max_size parameter	2021-09-29 12:15:48 +03:00
Botond Dénes	bc49c27a06	test/lib: mutation_source_test: test reading in reverse To ensure all mutation sources uniformly support the current API of reverse reading: reversed schema and half-reversed slice. This test will also ensure that once we switch to native-reverse slice, all mutation-sources will keep on working.	2021-09-29 12:15:48 +03:00
Kamil Braun	7d5273b044	test: mutation_reader_test: clustering_combined_reader_mutation_source_test: prepare for reading in reverse For reversed reads we must adjust the lower/upper bounds used by the `position_reader_queue` and `clustering_combined_reader`. The bounds are calculated using the mutation schema, but we need bounds calculated using the query schema which is reversed.	2021-09-29 12:15:48 +03:00
Botond Dénes	9399f379ec	test: flat_mutation_reader_test: test_reverse_reader_is_mutation_source: prepare for reading in reverse The mutation source test suite will soon test reads in reverse. Prepare for this by checking the reversed flag on the slice and not reversing the data when set. The test will have two modes effectively: * Forward mode: data is reversed before read, the reversed again during read. * Reverse mode: data is already reversed and it is reversed back during read.	2021-09-29 12:15:48 +03:00
Botond Dénes	c048d854d9	test: mutation_reader_test: test_manual_paused_evictable_reader_is_mutation_source: use query schema instead of table schema The two might not be the same in case the schema was upgraded or if we are reading in reverse. It is important to use the passed-in query schema consistently during a read.	2021-09-29 12:15:48 +03:00
Botond Dénes	41facb3270	treewide: move reversing to the mutation sources Push down reversing to the mutation-sources proper, instead of doing it on the querier level. This will allow us to test reverse reads on the mutation source level. The `max_size` parameter of `consume_page()` is now unused but is not removed in this patch, it will be removed in a follow-up to reduce churn.	2021-09-29 12:15:45 +03:00
Nadav Har'El	88177d7be7	test/alternator: add test for too many items in BatchWriteItem DynamoDB limits the number of items that a BatchWriteItem call can write to 25. As noted in issue #5057, in Alternator we don't have this limit or any limit on the number of items in a BatchWriteItem - which probably isn't wise. This patch adds a simple xfailing test for this. Refs #5057 Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210912140736.76995-1-nyh@scylladb.com>	2021-09-29 10:48:58 +02:00
Kamil Braun	7dc4ee35c9	sstable_set: time_series_sstable_set: reverse mode `time_series_sstable_set` uses `clustering_combined_reader` to implement efficient single-partition reads. It provides a `position_reader_queue` to the reader. This queue returns readers to the sstables from the set in order of the sstables' lower bounds, and with each reader it provides an upper bound for the positions-in-partition returned by the reader. Until now we would assume non-reversed queries only. Reversed queries were implemented by performing forward query in the lower layers and reversing the results at the upper-most layer of the reader stack. Before pushing the reversing down to the sources (in particular, to sstable readers), we need to support the reverse mode in `time_series_sstable_set` and the queue it provides to `clustering_combined_reader`. This requires using different lower and upper bounds in the queue. For non-reversed reads we used `sstable::min_position()` as the lower bound and `sstable::max_position()` as the upper bound. For reversed reads all comparisons performed by `clustering_combined_reader` will be reversed, as it will use a reversed schema. We can then use `sstable::max_position().reversed()` for the lower bound and `sstable::min_position().reversed()` for the upper bound.	2021-09-28 17:03:57 +03:00
Kamil Braun	270093b251	flat_mutation_reader_assertions: log what's expected	2021-09-28 17:03:57 +03:00
Tomasz Grabiec	c4328ffc4d	tests: mutation_test: Add test for position_in_partition::reversed() Message-Id: <20210927154942.44236-1-tgrabiec@scylladb.com>	2021-09-28 13:09:39 +02:00
Raphael S. Carvalho	afd45b9f49	compaction: Don't leak backlog of input sstable when compaction strategy is changed The generic backlog formula is: ALL + PARTIAL - COMPACTING With transfer_ongoing_charges() we already ignore the effect of ongoing compactions on COMPACTING as we judge them to be pointless. But ongoing compactions will run to completion, meaning that output sstables will be added to ALL anyway, in the formula above. With stop_tracking_ongoing_compactions(), input sstables are never removed from the tracker, but output sstables are added, which means we end up with duplicate backlog in the tracker. By removing this tracking mechanism, pointless ongoing compaction will be ignored as expected and the leaks will be fixed. Later, the intention is to force a stop on ongoing compactions if strategy has changed as they're pointless anyway. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-27 14:03:28 -03:00
Asias He	1657e7be14	gossiper: Send generation number with shutdown message Consider: - n1, n2 in the cluster - n2 shutdown - n2 sends gossip shutdown message to n1 - n1 delays processing of the handler of shutdown message - n2 restarts - n1 learns new gossip state of n2 - n1 resumes to handle the shutdown message - n1 will mark n2 as shutdown status incorrectly until n2 restarts again To prevent this, we can send the gossip generation number along with the shutdown message. If the generation number does not match the local generation number for the remote node, the shutdown message will be ignored. Since we use the rpc::optional to send the generation number, it works with mixed cluster. Fixes #8597 Closes #9381	2021-09-27 11:08:43 +03:00
Avi Kivity	d7ac699a55	Revert "Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael" This reverts commit `b5cf0b4489`, reversing changes made to `e8493e20cb`. It causes segmentation faults when sstable readers are closed. Fixes #9388.	2021-09-26 18:31:49 +03:00
Avi Kivity	bf94c06fc7	Revert "Merge "simplifications and layer violation fix for compaction manager" from Raphael" This reverts commit `7127c92acc`, reversing changes made to `88480ac504`. We need to revert `b5cf0b4489` to fix #9388, and this stands in the way. Ref #9388.	2021-09-26 18:30:36 +03:00
Avi Kivity	936de92876	Merge 'cql3: Add evaluate(expression) and use instead of term::bind()' from Jan Ciołek This PR adds the function: ```c++ constant evaluate(const expression&, const query_options&); ``` which evaluates the given expression to a constant value. It binds all the bound values, calls functions, and reduces the whole expression to just raw bytes and `data_type`, just like `bind()` and `get()` did for `term`. The code is often similar to the original `bind()` implementation in `lists.cc`, `sets.cc`, etc. * For some reason in the original code, when a collection contains `unset_value`, then the whole collection is evaluated to `unset_value`. I'm not sure why this is the case, considering it's impossible to have `unset_value` inside a collection, because we forbid bind markers inside collections. For example here: `cc8fc73761/cql3/lists.cc (L134)` This seems to have been introduced by Pekka Enberg in `50ec81ee67`, but he has left the company. I didn't change the behaviour, maybe there is a reason behind it, although maybe it would be better to just throw `invalid_request_exception`. * There was a strange limitation on map key size, it seems incorrect: `cc8fc73761/cql3/maps.cc (L150)`, but I left it in. * When evaluating a `user_type` value, the old code tolerated `unset_value` in a field, but it was later converted to NULL. This means that `unset_value` doesn't work inside a `user_type`, I didn't change it, will do in another PR. * We can't fully get rid of `bind()` yet, because it's used in `prepare_term` to return a `terminal`. It will be removed in the next PR, where we finally get rid of `term`. Closes #9353 * github.com:scylladb/scylla: cql3: types: Optimize abstract_type::contains_collection cql3: expr: Convert evaluate_IN_list to use evaluate(expression) cql3: expr: Use only evaluate(expression) to evaluate term cql3: expr: Implement evaluate(expr::function_call) cql3: expr: Implement evaluate(expr::usertype_constructor) cql3: expr: Implement evaluate(expr::collection_constructor) cql3: expr: Implement evaluate(expr::tuple_constructor) cql3: expr: Implement evaluate(expr::bind_variable) cql3: Add contains_collection/set_or_map to abstract_type cql3: expr: Add evaluate(expression, query_options) cql3: Implement term::to_expression for function_call cql3: Implement term::to_expression for user_type cql3: Implement term::to_expression for collections cql3: Implement term::to_expression for tuples cql3: Implement term::to_expression for marker classes cql3: expr: Add data_type to *_constructor structs cql3: Add term::to_expression method cql3: Reorganize term and expression includes	2021-09-26 12:58:11 +03:00
Pavel Emelyanov	88e5b7c547	database: Shutdown in tests There's a circular dependency: query processor needs database database owns large_data_handler and compaction_manager those two need qctx qctx owns a query_processor Respectively, the latter hidden dependency is not "tracked" by constructor arguments -- the query processor is started after the database and is deferred to be stopped before it. This works in scylla, because query processor doesn't really stop there, but in cql_test_env it's problematic as it stops everything, including the qctx. Recent database start-stop sanitation revealed this problem -- on database stop either l.d.h. or compaction manager try to start (or continue) messing with the query processor. One problem was faced immediatelly and pluged with the `75e1d7ea` safety check inside l.d.h., but still cql_test_env tests continue suffering from use after free on stopped query processor. The fix is to partially revert the `4b7846da` by making the tests stop some pieces of the database (inclusing l.d.h. and compaction manager) as it used to before. In scylla this is, probably, not needed, at least now -- the database shutdown code was and still is run right before the stopping one. tests: unit(debug) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Message-Id: <20210924080248.11764-1-xemul@scylladb.com>	2021-09-26 11:09:01 +03:00
Kamil Braun	bf823e34a4	raft: disable sticky leadership rule The Raft PhD presents the following scenario. When we remove a server from the cluster configuration, it does not receive the configuration entry which removes it (because the leader appending this entry uses that entry's configuration to decide to which servers to send the entry to, and the entry does not contain the removed server). Therefore the server keeps believing it is a member but does not receive heartbeats from leaders in the new configuration. Therefore it will keep becoming a candidate, causing existing leaders to step down, harming availability. With many such candidates the cluster may even stop being able to proceed at all. We call such servers "disruptive". More concretely, consider the following example, adapted from the PhD for joint configuration changes (the original PhD considered a different algorithm which can only add/remove one server at once): Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint configuration (C_old, C_new). D is the leader. D managed to append C_joint to every server and commit it. D appends C_new. At this point, D stops sending heartbeats to A because C_new does not contain A, but A's last entry is still C_joint, so it still has the ability to become a candidate. A can now become a candidate and cause D, or any other leader in C_new, to step down. Even if D manages to commit C_new, A can keep disrupting the cluster until it is shut down. Prevoting changes the situation, which the authors admit. The "even if" above no longer applies: if D manages to commit C_new, or just append it to a majority of C_new, then A won't be able to succeed in the prevote phase because a majority of servers in C_new has a longer log than A (and A must obtain a prevote from a majority of servers in C_new because A is in C_joint which contains C_new). But the authors continue to argue that disruptions can still occur during the small period where C_new is only appended on D but not yet on a majority of C_new. As they say: "we also did not want to assume that a leader will reliably replicate entries fast enough to move past the scenario (...) quickly; that might have worked in practice, but it depends on stronger assumptions that we prefer to avoid about the performance (...) of replicating log entries". One could probably try debunking this by saying that if entries take longer to replicate than the election timeout we're in much bigger trouble, but nevermind. In any case, the authors propose a solution which we call "sticky leadership". A server will not grant a vote to a candidate if it has recently received a heartbeat from the currently known leader, even if the candidate's term is higher. In the above example, servers in C_new would not grant votes to A as long as D keeps sending them heartbeats, thus A is no longer disruptive. In our case the situation is a bit different: in original Raft, "heartbeats" have a very specific meaning - they are append_entries requests (possibly empty) sent by leaders. Thus if a node stops being a leader it stops sending heartbeats; similarly, if a node leaves the configuration, it stops receiving heartbeats from others still in the configuration. We instead use a "shared failure detector" interface, where nodes may still consider other nodes alive regardless of their configuration/leadership situation, as part of the general "MultiRaft" framework. This pretty much invalidates the original argument, as seen on the above example: A will still consider D alive, thus it won't become a candidate. Shared failure detector combined with sticky leadership actually makes the situation worse - it may cause cluster unavailability in certain scenarios (fortunately not a permanent one, it can be solved with server restarts, for example). Randomized nemesis testing with reconfigurations found the following scenario: Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration C1, B is the leader. B commits joint (C1, C2), then new C2 configuration. Note that C does not learn about the last entry (since it's not part of C2) but it keeps believing that B is alive, so it keeps believing that B is the leader. We then partition {A} from {B, C}. A appends (C2, C3) joint configuration to its log. It's not able to append it to B or C due to the partition. The partition holds long enough for A to revert to candidate state (or we may restart A at this point). Eventually the partition resolves. The only node which can become a candidate now is A: C does not become a candidate because it keeps believeing that B is the leader, and B does not become a candidate because it saw the C2 non-joint entry being committed. However, A won't become a leader because C won't grant it a vote due to the sticky leadership rule. The cluster will remain unavailable until e.g. C is restarted. Note that this scenario requires allowing configuration changes which remove and then readd the same servers to the configuration. One may wonder if such reconfigurations should be allowed, but there doesn't seem to be any example of them breaking safety of Raft (and the PhD doesn't seem to mention them at all; perhaps it implicitly accepts them). It is unknown whether a similar scenario may be produced without such reconfigurations. In any case, disabling sticky leadership resolves the problem, and it is the last currently known availability problem found in randomized nemesis testing. There is no reason to keep this extension, both because the original Raft authors' argument does not apply for shared failure detector, and because one may even argue with the authors in vanilla Raft given that prevoting is enabled (see end of third paragraph of this commit message). Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>	2021-09-26 11:09:01 +03:00
Jan Ciolek	5589f348e7	cql3: expr: Implement evaluate(expr::bind_variable) Implement evaluating a bind_variable. To be able to evaluate a bind_variable we need to know the type of the bound value. This is why a data_type has been added to the bind_variable struct. There are some quirks when evaluating a bind_variable. The first problem occurs when the variable has been sent with an older cql serialization format and contains collections. In that case the value has to be reserialized to use the newest cql serialization format. The second problem occurs when there is a set or a map in the value. The set value sent by the driver might not have the elements in the correct order, contain duplicates etc. When a set or map is detected in the value it is reserialized as well. collection_type_impl::reserialize doesn't work for this purpose, because it uses data_value which does not perform sorting or removal. New code corresponds to old bind() of lists::marker in cql3/lists.cc, sets::marker etc. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Jan Ciolek	499c9235fc	cql3: expr: Add data_type to _constructor structs It is useful to have a data_type in _constructor structs when evaluating. The resulting constant has a data_type, so we have to find it somehow. For tuple_constructor we don't have to create a separate tuple_type_impl instance. For collection_constructor we know what the type is even in case of an empty collection. For usertype_constructor we know the name, type and order of fields in the user type. Additionally without a data_type we wouldn't know whether the type is reversed or not. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2021-09-24 11:05:53 +02:00
Tomasz Grabiec	f582bfd453	Merge "test: raft: randomized_nemesis_test: generator test with linearizability checking" from Kamil The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity. We use AppendReg instead of ExReg in `basic_generator_test` with a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line serializability checker with O(1) amortized complexity on each operation completion. We also enforce linearizability by checking that every completed operation was previously invoked. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call. * kbr/linearizability-v2: test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test test: raft: randomized_nemesis_test: introduce append register	2021-09-23 23:55:13 +02:00
Avi Kivity	7127c92acc	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-09-23 17:29:19 +03:00
Raphael S. Carvalho	5bf51ced14	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:56:18 -03:00
Raphael S. Carvalho	6820fbf460	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:36 -03:00
Raphael S. Carvalho	b6b4042faf	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:11 -03:00
Raphael S. Carvalho	98f8673d4e	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:25:51 -03:00
Raphael S. Carvalho	0885376a85	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:00:49 -03:00
Raphael S. Carvalho	7688d0432c	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 09:56:01 -03:00
Piotr Sarna	88480ac504	cql-pytest: relax another condition for a failed wasm execution The previous commit already relaxed the condition for test_fib, but the same should be done for test_fib_called_on_null for an identical reason - more than 1 error can be expected in the case of calling heavily recursive function, and either fuel exhaustion, or hitting the stack limit, or any other InvalidRequest exception should be accepted. Closes #9363	2021-09-23 14:11:02 +03:00
Piotr Sarna	62948b7404	Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek Add new struct to the `expression` variant: ```c++ // A value serialized with the internal (latest) cql_serialization_format struct constant { cql3::raw_value value; data_type type; // Never nullptr, for NULL and UNSET might be empty_type }; ``` and use it where possible instead of `terminal`. This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely. We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression. `bool` is removed from `expression`, now `constant` is used instead. This is a redesign of PR #9203, there is some discussion about the chosen representation there. Closes #9371 * github.com:scylladb/scylla: cql3: term: Remove get_elements and multi_item_terminal from terminals cql3: Replace most uses of terminal with expr::constant cql3: expr: Remove repetition from expr::get_elements cql3: expr: Add expr::get_elements(constant) cql3: term: remove term::bind_and_get cql3: Replace all uses of bind_and_get with evaluate_to_raw_view cql3: expr: Add evaluate_IN_list cql3: tuples: Implement tuples::in_value::get cql3: Move data_type to terminal, make get_value_type non-virtual cql3: user_types: Implement get_value_type in user_types.hh cql3: tuples: Implement get_value_type in tuples.hh cql3: maps: Implement get_value_type in maps.hh cql3: sets: Implement get_value_type in sets.hh cql3: lists: Implement get_value_type in lists.hh cql3: constants: Implement get_value_type in constants.hh cql3: expr: Add expr::evaluate cql3: Make collection term get() use the internal serialization format cql3: values: Add unset value to raw_value_view::make_temporary cql3: expr: Add constant to expression	2021-09-23 13:02:29 +02:00
Avi Kivity	369afe3124	treewide: use coroutine::maybe_yield() instead of co_await make_ready_future() The dedicated API shows the intent, and may be a tiny bit faster. Closes #9382	2021-09-23 12:28:56 +02:00
Avi Kivity	6702711d9c	Merge "Gossiper start-stop sanitation (+ bonus track)" from Pavel E " The main challenge here is to move messaging_service.start_listen() call from out of gossiper into main. Other changes are pretty minor compared to that and include - patch gossiper API towards a standard start-shutdown-stop form - gossiping "sharder info" in initial state - configure cluster name and seeds via gossip_config tests: unit(dev) dtest.bootstrap_test.start_stop_test_node(dev) manual(dev): start+stop, nodetool enable-/disablegossip refs: #2737 refs: #2795 refs: #5489 " * 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla: code: Expell gossiper.hh from other headers storage_service: Gossip "sharder" in initial states gossiper: Relax set_seeds() gossiper, main: Turn init_gossiper into get_seeds_from_config storage_service: Eliminate the do-bind argument from everywhere gossiper: Drop ms-registered manipulations messaging, main, gossiper: Move listening start into main gossiper: Do handlers reg/unreg from start/stop gossiper: Split (un)init_messaging_handler() gossiper: Relocate stop_gossiping() into .stop() gossiper: Introduce .shutdown() and use where appropriate gossiper: Set cluster_name via gossip_config gossiper, main: Straighten start/stop tests/cql_test_env: Open-code tst_init_ms_fd_gossiper tests/cql_test_env: De-global most of gossiper gossiper: Merge start_gossiping() overloads into one gossiper: Use is_... helpers gossiper: Fix do_shadow_round comment gossiper: Dispose dead code	2021-09-23 12:18:38 +03:00
Kamil Braun	ea172fe531	test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test Use AppendReg instead of ExReg for the state machine. Use a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line linearizability checker with O(1) amortized complexity on each operation completion. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call.	2021-09-22 17:56:23 +02:00
Nadav Har'El	92570ea7d9	cql-pytest: add tests on behavior of empty-string keys We know (verified by existing tests) that null keys are not allowed - neither as partition keys nor clustering keys. In issue #9352 a question was raised of whether an empty string is allowed as as a key on a base table (not a materialized view or index). The following tests confirm that the current situation is as follows: 1. An empty string is perfectly legal as a clustering key. 2. An empty string is NOT ALLOWED as a partition key - the error "Key may not be empty" is reported if this is attempted. 3. If the partition key is compound (multiple partition-key columns) then any or all of them may be empty strings. These tests pass the same on both Cassandra and Scylla, showing that this bizarre (and undocumented) behavior is identical in both. Refs #9352. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922131310.293846-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	083279d9ab	Merge "Generalize sstable creation for tests" from Pavel E " There's a whole lot of places that create an sstable for tests like this auto sst = env.make_sstable(...); sst->write_components(...); sst->load(); Some of them are already generalized with the make_sstable_easy helper, but there are several instances of them. Found while hunting down the places that use default IO sched class behind the scenes. tests: unit(dev) " * 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla: test: Generalize make_sstable() and make_sstable_easy() test: Use now existing helpers elsewhere test: Generalize all make_sstable_easy()-s test: Set test change estimation to 1 test: Generalize make_sstable_easy in mutation tests test: Generalize make_sstable_easy in set tests test: Reuse make_sstable_easy in datafile tests test: Relax make_sstable_easy in compaction tests	2021-09-22 18:55:25 +03:00
Nadav Har'El	a99a774731	cql-pytest: test for secondary-index on empty-string value When a string column is indexed with a secondary index, the empty value for this column (an empty string '') is perfectly legal, and should be indexed as well. This is not the same as an unset (null) value which isn't indexed. The following test demonstrates that this case works in Cassandra, but does not in Scylla (so the test is marked "xfail"). In Scylla, a query that returns the expected results with ALLOW FILTERING suddenly returns a different (and wrong) result when an index is added on the table. This test reproduces issue #9364. Refs #9364. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922121510.291826-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	b5cf0b4489	Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael " Backlog tracker isn't updated correctly when facing a schema change, and may leak a SSTable if compaction strategy is changed, which causes backlog to be computed incorrectly. Most of these problems happen because sstable set and tracker are updated independently, so it could happen that tracker lose track (pun intended) of changes applied to set. The first patch will fix the leak when strategy is changed, and the third patch will make sure that tracker is updated atomically with sstable set, so these kind of problems will not happen anymore. Fixes #9157 test: mode(debug) " * 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla: compaction: Update backlog tracker correctly when schema is updated compaction: Don't leak backlog of input sstable when compaction strategy is changed compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() compaction: simplify removal of monitors	2021-09-22 18:55:25 +03:00
Nadav Har'El	e8493e20cb	cql-pytest: test for empty-string as partition key in materialized view Scylla and Cassandra do not allow an empty string as a partition key, but a materialized view might "convert" a regular string column into a partition key, and an empty string is a perfectly valid value for this column. This can result in a view row which has an empty string as a partition key. This case works in Cassandra, but doesn't in Scylla (the row with the empty string as a partition key doesn't appear). The following test demonstrates this difference between Scylla and Cassandra (it passes on Cassandra, fails on Scylla, and accordingly marked "xfail"). Refs #9375. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922115000.290387-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Botond Dénes	3f4f408bcf	schema: add get_reversed() A variant of make_reversed() which goes through the schema registry, teaching the schema to the registry if necessary. This effectively caches the result of the reversing and as an added bonus double reversing yields the very same schema C++ object that was the starting point. Closes #9365	2021-09-22 18:55:25 +03:00
Kamil Braun	81b7ed23bb	test: raft: randomized_nemesis_test: introduce append register The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity.	2021-09-22 17:54:07 +02:00

1 2 3 4 5 ...

2309 Commits