scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-06-05 22:43:15 +00:00

Author	SHA1	Message	Date
Benny Halevy	798aee6747	dht: boot_strapper: coroutinize bootstrap Prepare for futurizing get_pending_address_ranges. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923144206.1690576-1-bhalevy@scylladb.com>	2021-09-26 11:09:01 +03:00
Kamil Braun	bf823e34a4	raft: disable sticky leadership rule The Raft PhD presents the following scenario. When we remove a server from the cluster configuration, it does not receive the configuration entry which removes it (because the leader appending this entry uses that entry's configuration to decide to which servers to send the entry to, and the entry does not contain the removed server). Therefore the server keeps believing it is a member but does not receive heartbeats from leaders in the new configuration. Therefore it will keep becoming a candidate, causing existing leaders to step down, harming availability. With many such candidates the cluster may even stop being able to proceed at all. We call such servers "disruptive". More concretely, consider the following example, adapted from the PhD for joint configuration changes (the original PhD considered a different algorithm which can only add/remove one server at once): Let C_old = {A, B, C, D}, C_new = {B, C, D}, and C_joint be the joint configuration (C_old, C_new). D is the leader. D managed to append C_joint to every server and commit it. D appends C_new. At this point, D stops sending heartbeats to A because C_new does not contain A, but A's last entry is still C_joint, so it still has the ability to become a candidate. A can now become a candidate and cause D, or any other leader in C_new, to step down. Even if D manages to commit C_new, A can keep disrupting the cluster until it is shut down. Prevoting changes the situation, which the authors admit. The "even if" above no longer applies: if D manages to commit C_new, or just append it to a majority of C_new, then A won't be able to succeed in the prevote phase because a majority of servers in C_new has a longer log than A (and A must obtain a prevote from a majority of servers in C_new because A is in C_joint which contains C_new). But the authors continue to argue that disruptions can still occur during the small period where C_new is only appended on D but not yet on a majority of C_new. As they say: "we also did not want to assume that a leader will reliably replicate entries fast enough to move past the scenario (...) quickly; that might have worked in practice, but it depends on stronger assumptions that we prefer to avoid about the performance (...) of replicating log entries". One could probably try debunking this by saying that if entries take longer to replicate than the election timeout we're in much bigger trouble, but nevermind. In any case, the authors propose a solution which we call "sticky leadership". A server will not grant a vote to a candidate if it has recently received a heartbeat from the currently known leader, even if the candidate's term is higher. In the above example, servers in C_new would not grant votes to A as long as D keeps sending them heartbeats, thus A is no longer disruptive. In our case the situation is a bit different: in original Raft, "heartbeats" have a very specific meaning - they are append_entries requests (possibly empty) sent by leaders. Thus if a node stops being a leader it stops sending heartbeats; similarly, if a node leaves the configuration, it stops receiving heartbeats from others still in the configuration. We instead use a "shared failure detector" interface, where nodes may still consider other nodes alive regardless of their configuration/leadership situation, as part of the general "MultiRaft" framework. This pretty much invalidates the original argument, as seen on the above example: A will still consider D alive, thus it won't become a candidate. Shared failure detector combined with sticky leadership actually makes the situation worse - it may cause cluster unavailability in certain scenarios (fortunately not a permanent one, it can be solved with server restarts, for example). Randomized nemesis testing with reconfigurations found the following scenario: Let C1 = {A, B, C}, C2 = {A}, C3 = {B, C}. We start from configuration C1, B is the leader. B commits joint (C1, C2), then new C2 configuration. Note that C does not learn about the last entry (since it's not part of C2) but it keeps believing that B is alive, so it keeps believing that B is the leader. We then partition {A} from {B, C}. A appends (C2, C3) joint configuration to its log. It's not able to append it to B or C due to the partition. The partition holds long enough for A to revert to candidate state (or we may restart A at this point). Eventually the partition resolves. The only node which can become a candidate now is A: C does not become a candidate because it keeps believeing that B is the leader, and B does not become a candidate because it saw the C2 non-joint entry being committed. However, A won't become a leader because C won't grant it a vote due to the sticky leadership rule. The cluster will remain unavailable until e.g. C is restarted. Note that this scenario requires allowing configuration changes which remove and then readd the same servers to the configuration. One may wonder if such reconfigurations should be allowed, but there doesn't seem to be any example of them breaking safety of Raft (and the PhD doesn't seem to mention them at all; perhaps it implicitly accepts them). It is unknown whether a similar scenario may be produced without such reconfigurations. In any case, disabling sticky leadership resolves the problem, and it is the last currently known availability problem found in randomized nemesis testing. There is no reason to keep this extension, both because the original Raft authors' argument does not apply for shared failure detector, and because one may even argue with the authors in vanilla Raft given that prevoting is enabled (see end of third paragraph of this commit message). Message-Id: <20210921153741.65084-1-kbraun@scylladb.com>	2021-09-26 11:09:01 +03:00
Tomasz Grabiec	f582bfd453	Merge "test: raft: randomized_nemesis_test: generator test with linearizability checking" from Kamil The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity. We use AppendReg instead of ExReg in `basic_generator_test` with a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line serializability checker with O(1) amortized complexity on each operation completion. We also enforce linearizability by checking that every completed operation was previously invoked. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call. * kbr/linearizability-v2: test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test test: raft: randomized_nemesis_test: introduce append register	2021-09-23 23:55:13 +02:00
Benny Halevy	7e9ca101ae	storage_service: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-31-bhalevy@scylladb.com>	2021-09-23 17:36:43 +03:00
Benny Halevy	ecbe9f1ef6	storage_service: coroutinize rebuild Prepare for futurizing get_ranges_for_endpoint. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-30-bhalevy@scylladb.com>	2021-09-23 17:36:42 +03:00
Benny Halevy	c8b12afe1b	storage_service: effective_ownership: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-29-bhalevy@scylladb.com>	2021-09-23 17:35:32 +03:00
Benny Halevy	add78a8cc0	storage_service: coroutinize effective_ownership Prepare for futurizing get_ranges_for_endpoint. Dtest: nodetool_additional_test:TestNodetool.status_test Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-28-bhalevy@scylladb.com>	2021-09-23 17:34:56 +03:00
Avi Kivity	7127c92acc	Merge "simplifications and layer violation fix for compaction manager" from Raphael "This series removes layer violation in compaction, and also simplifies compaction manager and how it interacts with compaction procedure." * 'compaction_manager_layer_violation_fix/v3' of github.com:raphaelsc/scylla: compaction: split compaction info and data for control compaction_manager: use task when stopping a given compaction type compaction: remove start_size and end_size from compaction_info compaction_manager: introduce helpers for task compaction_manager: introduce explicit ctor for task compaction: kill sstables field in compaction_info compaction: kill table pointer in compaction_info compaction: simplify procedure to stop ongoing compactions compaction: move management of compaction_info to compaction_manager compaction: move output run id from compaction_info into task	2021-09-23 17:29:19 +03:00
Raphael S. Carvalho	5bf51ced14	compaction: split compaction info and data for control compaction_info must only contain info data to be exported to the outside world, whereas compaction_data will contain data for controlling compaction behavior and stats which change as compaction progresses. This separation makes the interface clearer, also allowing for future improvements like removing direct references to table in compaction. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:56:18 -03:00
Raphael S. Carvalho	6e7729fa21	compaction_manager: use task when stopping a given compaction type compaction_info will eventually only be used for exporting data about ongoing compactions, so task must be used instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:53:53 -03:00
Raphael S. Carvalho	6d1170ac94	compaction: remove start_size and end_size from compaction_info those stats aren't used in compaction stats API and therefore they can be removed. end_size is added to compaction_result (needed for updating history) and start_size can be calculated in advance. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:41:13 -03:00
Raphael S. Carvalho	2353f40f63	compaction_manager: introduce helpers for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:39 -03:00
Raphael S. Carvalho	6820fbf460	compaction_manager: introduce explicit ctor for task Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:36 -03:00
Raphael S. Carvalho	d73a241a4e	compaction: kill sstables field in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:32 -03:00
Raphael S. Carvalho	b6b4042faf	compaction: kill table pointer in compaction_info Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:38:11 -03:00
Raphael S. Carvalho	98f8673d4e	compaction: simplify procedure to stop ongoing compactions Today, compactions are tracked by both _compactions and _tasks, where _compactions refer to actual ongoing compaction tasks, whereas _tasks refer to manager tasks which is responsible for spawning new compactions, retry them on failure, etc. As each task can only have one ongoing compaction at a time, let's move compaction into task, such that manager won't have to look at both when deciding to do something like stopping a task. So stopping a task becomes simpler, and duplication is naturally gone. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:25:51 -03:00
Raphael S. Carvalho	0885376a85	compaction: move management of compaction_info to compaction_manager Today, compaction is calling compaction manager to register / deregister the compaction_info created by it. This is a layer violation because manager sits one layer above compaction, so manager should be responsible for managing compaction info. From now on, compaction_info will be created and managed by compaction_manager. compaction will only have a reference to info, which it can use to update the world about compaction progress. This will allow compaction_manager to be simplified as info can be coupled with its respective task, allowing duplication to be removed and layer violation to be fixed. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 10:00:49 -03:00
Raphael S. Carvalho	7688d0432c	compaction: move output run id from compaction_info into task this run id is used to track partial runs that are being written to. let's move it from info into task, as this is not an external info, but rather one that belongs to compaction_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2021-09-23 09:56:01 -03:00
Piotr Sarna	88480ac504	cql-pytest: relax another condition for a failed wasm execution The previous commit already relaxed the condition for test_fib, but the same should be done for test_fib_called_on_null for an identical reason - more than 1 error can be expected in the case of calling heavily recursive function, and either fuel exhaustion, or hitting the stack limit, or any other InvalidRequest exception should be accepted. Closes #9363	2021-09-23 14:11:02 +03:00
Benny Halevy	ad46ff8e5e	database: coroutinize create_keyspace Prepare for futurizing on create_in_memory_keyspace. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-10-bhalevy@scylladb.com>	2021-09-23 14:05:44 +03:00
Benny Halevy	91091e9d89	database: update_keyspace: fixup indentation Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-9-bhalevy@scylladb.com>	2021-09-23 14:05:18 +03:00
Benny Halevy	c71cd2bed3	database: coroutinize update_keyspace Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Message-Id: <20210923093200.1559734-8-bhalevy@scylladb.com>	2021-09-23 14:05:18 +03:00
Piotr Sarna	62948b7404	Merge 'cql3: Add expr::constant to replace terminal' from Jan Ciołek Add new struct to the `expression` variant: ```c++ // A value serialized with the internal (latest) cql_serialization_format struct constant { cql3::raw_value value; data_type type; // Never nullptr, for NULL and UNSET might be empty_type }; ``` and use it where possible instead of `terminal`. This struct will eventually replace all classes deriving from `terminal`, but for now `terminal` can't be removed completely. We can't get rid of terminal yet, because sometimes `terminal` is converted back to `term`, which `constant` can't do. This won't be a problem once we replace term with expression. `bool` is removed from `expression`, now `constant` is used instead. This is a redesign of PR #9203, there is some discussion about the chosen representation there. Closes #9371 * github.com:scylladb/scylla: cql3: term: Remove get_elements and multi_item_terminal from terminals cql3: Replace most uses of terminal with expr::constant cql3: expr: Remove repetition from expr::get_elements cql3: expr: Add expr::get_elements(constant) cql3: term: remove term::bind_and_get cql3: Replace all uses of bind_and_get with evaluate_to_raw_view cql3: expr: Add evaluate_IN_list cql3: tuples: Implement tuples::in_value::get cql3: Move data_type to terminal, make get_value_type non-virtual cql3: user_types: Implement get_value_type in user_types.hh cql3: tuples: Implement get_value_type in tuples.hh cql3: maps: Implement get_value_type in maps.hh cql3: sets: Implement get_value_type in sets.hh cql3: lists: Implement get_value_type in lists.hh cql3: constants: Implement get_value_type in constants.hh cql3: expr: Add expr::evaluate cql3: Make collection term get() use the internal serialization format cql3: values: Add unset value to raw_value_view::make_temporary cql3: expr: Add constant to expression	2021-09-23 13:02:29 +02:00
Avi Kivity	369afe3124	treewide: use coroutine::maybe_yield() instead of co_await make_ready_future() The dedicated API shows the intent, and may be a tiny bit faster. Closes #9382	2021-09-23 12:28:56 +02:00
Avi Kivity	6702711d9c	Merge "Gossiper start-stop sanitation (+ bonus track)" from Pavel E " The main challenge here is to move messaging_service.start_listen() call from out of gossiper into main. Other changes are pretty minor compared to that and include - patch gossiper API towards a standard start-shutdown-stop form - gossiping "sharder info" in initial state - configure cluster name and seeds via gossip_config tests: unit(dev) dtest.bootstrap_test.start_stop_test_node(dev) manual(dev): start+stop, nodetool enable-/disablegossip refs: #2737 refs: #2795 refs: #5489 " * 'br-gossiper-dont-start-messaging-listen-2' of https://github.com/xemul/scylla: code: Expell gossiper.hh from other headers storage_service: Gossip "sharder" in initial states gossiper: Relax set_seeds() gossiper, main: Turn init_gossiper into get_seeds_from_config storage_service: Eliminate the do-bind argument from everywhere gossiper: Drop ms-registered manipulations messaging, main, gossiper: Move listening start into main gossiper: Do handlers reg/unreg from start/stop gossiper: Split (un)init_messaging_handler() gossiper: Relocate stop_gossiping() into .stop() gossiper: Introduce .shutdown() and use where appropriate gossiper: Set cluster_name via gossip_config gossiper, main: Straighten start/stop tests/cql_test_env: Open-code tst_init_ms_fd_gossiper tests/cql_test_env: De-global most of gossiper gossiper: Merge start_gossiping() overloads into one gossiper: Use is_... helpers gossiper: Fix do_shadow_round comment gossiper: Dispose dead code	2021-09-23 12:18:38 +03:00
Avi Kivity	bae9c042c2	Merge 'Add compaction stats to tracing data' from Botond Dénes Too many tombstones (row or range) are a common source of query performance problems, yet currently we have no visibility into the amount of tombstones a query has to process while constructing the results. This series addresses this by collecting stats about the compacted data in `compact_mutation_state`. This contains the number of partitions, static rows (live and dead), clustering rows (live and dead) and range tombstones. This data is then added to tracing on each query path. Example trace: ``` activity \| timestamp \| source \| source_elapsed \| client ---------------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+----------- Execute CQL3 query \| 2021-09-22 12:06:24.089000 \| 127.0.0.1 \| 0 \| 127.0.0.1 Parsing a statement [shard 0] \| 2021-09-22 12:06:24.089552 \| 127.0.0.1 \| 1 \| 127.0.0.1 Processing a statement [shard 0] \| 2021-09-22 12:06:24.089674 \| 127.0.0.1 \| 122 \| 127.0.0.1 Creating read executor for token -4069959284402364209 with all: {127.0.0.1} targets: {127.0.0.1} repair decision: NONE [shard 0] \| 2021-09-22 12:06:24.089724 \| 127.0.0.1 \| 173 \| 127.0.0.1 read_data: querying locally [shard 0] \| 2021-09-22 12:06:24.089727 \| 127.0.0.1 \| 175 \| 127.0.0.1 Start querying singular range {{-4069959284402364209, pk{000400000001}}} [shard 0] \| 2021-09-22 12:06:24.089732 \| 127.0.0.1 \| 181 \| 127.0.0.1 Querying cache for range {{-4069959284402364209, pk{000400000001}}} and slice {(-inf, +inf)} [shard 0] \| 2021-09-22 12:06:24.089751 \| 127.0.0.1 \| 199 \| 127.0.0.1 Page stats: 1 partition(s), 0 static row(s) (0 live, 0 dead), 4 clustering row(s) (3 live, 1 dead) and 1 range tombstone(s) [shard 0] \| 2021-09-22 12:06:24.089838 \| 127.0.0.1 \| 286 \| 127.0.0.1 Querying is done [shard 0] \| 2021-09-22 12:06:24.089847 \| 127.0.0.1 \| 295 \| 127.0.0.1 Done processing - preparing a result [shard 0] \| 2021-09-22 12:06:24.089862 \| 127.0.0.1 \| 311 \| 127.0.0.1 Request complete \| 2021-09-22 12:06:24.089326 \| 127.0.0.1 \| 326 \| 127.0.0.1 ``` Tests: unit(dev) Fixes: https://github.com/scylladb/scylla/issues/5471 Closes #9372 * github.com:scylladb/scylla: multishard_mutation_query: add tracepoint with compaction stats querier: add tracepoint with compaction stats mutation_compactor: collect stats about compacted data	2021-09-22 19:24:19 +03:00
Kamil Braun	ea172fe531	test: raft: randomized_nemesis_test: check consistency and liveness in basic_generator_test Use AppendReg instead of ExReg for the state machine. Use a generator which generates a sequence of append operations with unique integers. This implies that the result of every operation uniquely identifies the operation (since it contains the appended integer, and different operations use different integers) and all operations that must have happened before it (since it contains the previous state of the append register), which allows us to reconstruct the "current state" of the register according to the results of operations coming from Raft calls, giving us an on-line linearizability checker with O(1) amortized complexity on each operation completion. We also perform a simple liveness check at the end of the test by ensuring that a leader becomes eventually elected and that we can successfully execute a call.	2021-09-22 17:56:23 +02:00
Avi Kivity	c0afdf3f15	Update seastar submodule * seastar c04a12edbd...e6db0cd587 (13): > Merge "Add kernel stack trace reporting for stalls" from Avi Ref #8828 > Merge "Keep XFS' dioattr cached" from Pavel E > coroutines: de-template maybe_yield() > sharded: Add const versions of map_reduce's > apps/io_tester: remove unused lambda capture > doc: exclude seastar::coroutine::internal namespace > deprecate unaligned_cast<> from unaligned.hh > reactor: adjust max_networking_aio_io_control_blocks to lower size when fs.aio-max-nr is small > build: clarify choice of C++ dialect, and change default to C++20 > coding_style: update concepts style to snake_case > Merge "Teach io_tester to submit requests-per-second flow" from Pavel E > cmake: find and link against Boost::filesystem > coroutine: add maybe_yield	2021-09-22 18:55:25 +03:00
Nadav Har'El	92570ea7d9	cql-pytest: add tests on behavior of empty-string keys We know (verified by existing tests) that null keys are not allowed - neither as partition keys nor clustering keys. In issue #9352 a question was raised of whether an empty string is allowed as as a key on a base table (not a materialized view or index). The following tests confirm that the current situation is as follows: 1. An empty string is perfectly legal as a clustering key. 2. An empty string is NOT ALLOWED as a partition key - the error "Key may not be empty" is reported if this is attempted. 3. If the partition key is compound (multiple partition-key columns) then any or all of them may be empty strings. These tests pass the same on both Cassandra and Scylla, showing that this bizarre (and undocumented) behavior is identical in both. Refs #9352. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922131310.293846-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	083279d9ab	Merge "Generalize sstable creation for tests" from Pavel E " There's a whole lot of places that create an sstable for tests like this auto sst = env.make_sstable(...); sst->write_components(...); sst->load(); Some of them are already generalized with the make_sstable_easy helper, but there are several instances of them. Found while hunting down the places that use default IO sched class behind the scenes. tests: unit(dev) " * 'br-sst-tests-make-sstable-easy' of https://github.com/xemul/scylla: test: Generalize make_sstable() and make_sstable_easy() test: Use now existing helpers elsewhere test: Generalize all make_sstable_easy()-s test: Set test change estimation to 1 test: Generalize make_sstable_easy in mutation tests test: Generalize make_sstable_easy in set tests test: Reuse make_sstable_easy in datafile tests test: Relax make_sstable_easy in compaction tests	2021-09-22 18:55:25 +03:00
Nadav Har'El	a99a774731	cql-pytest: test for secondary-index on empty-string value When a string column is indexed with a secondary index, the empty value for this column (an empty string '') is perfectly legal, and should be indexed as well. This is not the same as an unset (null) value which isn't indexed. The following test demonstrates that this case works in Cassandra, but does not in Scylla (so the test is marked "xfail"). In Scylla, a query that returns the expected results with ALLOW FILTERING suddenly returns a different (and wrong) result when an index is added on the table. This test reproduces issue #9364. Refs #9364. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922121510.291826-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Avi Kivity	b5cf0b4489	Merge "compaction: Update backlog tracker correctly when schema is updated" from Raphael " Backlog tracker isn't updated correctly when facing a schema change, and may leak a SSTable if compaction strategy is changed, which causes backlog to be computed incorrectly. Most of these problems happen because sstable set and tracker are updated independently, so it could happen that tracker lose track (pun intended) of changes applied to set. The first patch will fix the leak when strategy is changed, and the third patch will make sure that tracker is updated atomically with sstable set, so these kind of problems will not happen anymore. Fixes #9157 test: mode(debug) " * 'fixes_to_backlog_tracker_v3' of https://github.com/raphaelsc/scylla: compaction: Update backlog tracker correctly when schema is updated compaction: Don't leak backlog of input sstable when compaction strategy is changed compaction: introduce compaction_read_monitor_generator::remove_exhausted_sstables() compaction: simplify removal of monitors	2021-09-22 18:55:25 +03:00
Nadav Har'El	e8493e20cb	cql-pytest: test for empty-string as partition key in materialized view Scylla and Cassandra do not allow an empty string as a partition key, but a materialized view might "convert" a regular string column into a partition key, and an empty string is a perfectly valid value for this column. This can result in a view row which has an empty string as a partition key. This case works in Cassandra, but doesn't in Scylla (the row with the empty string as a partition key doesn't appear). The following test demonstrates this difference between Scylla and Cassandra (it passes on Cassandra, fails on Scylla, and accordingly marked "xfail"). Refs #9375. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Message-Id: <20210922115000.290387-1-nyh@scylladb.com>	2021-09-22 18:55:25 +03:00
Piotr Jastrzebski	56888c8954	docs: clean up codeowners Recently we had to say goodbye to our dear friend Pekka. He orphaned a few subsystems that can't call for his help in code reviews anymore. This patch makes sure no one will bother Pekka in his afterlife. It also cleanups HACKING.md a little bit by removing Pekka and Duarte from the maintainer/reviewer lists. Signed-off-by: Piotr Jastrzebski <piotr@scylladb.com> Message-Id: <98ba1aed9ee8a87b9037b5032b82abc5bfddbd66.1632301309.git.piotr@scylladb.com>	2021-09-22 18:55:25 +03:00
Botond Dénes	3f4f408bcf	schema: add get_reversed() A variant of make_reversed() which goes through the schema registry, teaching the schema to the registry if necessary. This effectively caches the result of the reversing and as an added bonus double reversing yields the very same schema C++ object that was the starting point. Closes #9365	2021-09-22 18:55:25 +03:00
Kamil Braun	81b7ed23bb	test: raft: randomized_nemesis_test: introduce append register The AppendReg state machine stores a sequence of integers. It supports `append` inputs which append a single integer to the sequence and return the previous state (before appending). The implementation uses the `append_seq` data structure representing an immutable sequence that uses a vector underneath which may be shared by multiple instances of `append_seq`. Appending to the sequence appends to the underlying vector, but there is no observable effect on the other instances since they use only the prefix of the sequence that wasn't changed. If two instances sharing the same vector try to append, the later one must perform a copy. This allows efficient appends if only one instance is appending, which is useful in the following context: - a Raft server stores a copy in the underlying state machine replica and appends to it, - clients send append operations to the server; the server returns the state of the sequence before it was appended to, - thanks to the sharing, we don't need to copy all elements when returning the sequence to the client, and only one instance (the server) is appending to the shared vector, - summarizing, all operations have amortized O(1) complexity.	2021-09-22 17:54:07 +02:00
Botond Dénes	922295dd8e	multishard_mutation_query: add tracepoint with compaction stats Add the content of the compaction stats introduced in the previous patch to the tracing data. This will help diagnose query performance related problems caused by tombstones.	2021-09-22 14:00:24 +03:00
Botond Dénes	eba46e353d	querier: add tracepoint with compaction stats Add the content of the compaction stats introduced in the previous patch to the tracing data. This will help diagnose query performance related problems caused by tombstones.	2021-09-22 14:00:05 +03:00
Botond Dénes	f0ead81250	mutation_compactor: collect stats about compacted data Stats contain the number of partitions, static rows, clustering rows and range tombstones. For rows dead/live are counted separately.	2021-09-22 13:59:19 +03:00
Pavel Emelyanov	598841a5dd	code: Expell gossiper.hh from other headers This needs to add forward declarations of the gossiper class and re-include some other headers here and there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	6875a4b292	storage_service: Gossip "sharder" in initial states Right now the number of shards and ignore-msb-bits are gossiped with a separate call. It's simpler to include this data into the initial gossiping state. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	968e117315	gossiper: Relax set_seeds() It's much shorter and simpler to pass the seeds, obtained from the config, into gossiper via gossip_config rahter than with the help of a special call. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	2b63c4c16f	gossiper, main: Turn init_gossiper into get_seeds_from_config Looking into init_gossiper() helper makes it clear that what it does is gets seeds, provider and listen_address from config and generates a set of seeds for the gossiper. Then calls gossiper.set_seeds(). This patch renames the helper into get_seeds_from_config(), removes all but db::config& argunebts from it and moves the call to set_seed() into main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	7680274e02	storage_service: Eliminate the do-bind argument from everywhere The same as in previous patch -- the gossiper doesn't need to know if it should call messaging.start_listen() or not, neither should do the storage_service. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	0607a2b84f	gossiper: Drop ms-registered manipulations Now it's no-op and can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	ca316f32f0	messaging, main, gossiper: Move listening start into main Before preparing the cluster join process the messaging should be put into listening state. Right now it's done "on-demand" by the call to the do_shadow_round(), also there's a safety call in the start_gossiping(). Tests, however, should not start listening, so the do_bind boolean exists and is passed all the way around. Make the main() code explicitly call the messaging.start_listen() and leave tests without it. This change makes messaging start listening a bit earlier, but in between these old and new places there's nothing that needs messaging to stay deaf. As the do_bind becomes useless, the wait_for_gossip_to_settle() is also moved into main. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	f644eb1cf7	gossiper: Do handlers reg/unreg from start/stop On start handlers can be registered any time before the messaging starts to listen. On stop handlers can remain registered any long, since the messaging service stops early in drain_on_shutdown(). One tricky place is API start_/stop_gossiping(). The latter calls gossiper::stop() thus unregistering the handlers. So to make the start_gossiping() work it must call gossiper::start() in advance. Overall the gossiper start/stop becomes this: gossiper.start() `- registers handlers gossiper.start_gossiping() `- // starts gossiping gossiper.shutdown() `- // stops gossiping gossiper.stop() `- calls shutdown() // re-entrable `- unregisters handlers Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	9aba3e6f9f	gossiper: Split (un)init_messaging_handler() As a preparation for the next patch. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	dfe54207cb	gossiper: Relocate stop_gossiping() into .stop() The helper in question is called in two places: 1. In main() as a fuse against early exception before creating the drain_on_shutdown() defer 2. In the stop_gossiping() API call Both can be replaced with the stop_gossiping() call from the .stop() method, here's why: 1. In main the gossiper::stop() call is already deferred right after the gossiper is started. So this change moves it above. It may happen that an exception pops up before the old fuse was deferred, but that's OK -- the stop_gossiping() is safe against early- and re- entrances 2. The stop_gossiping() change is effectlvey a rename -- it calls the stop_gossiping() as it did before, but with the help of the .stop() method Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00
Pavel Emelyanov	e24c5034b5	gossiper: Introduce .shutdown() and use where appropriate The start/stop sequence we're moving towards assumes a shutdown (or drain) method that will be called early on stop to notify the service that the system is going down so it could prepare. For gossiper it already means calling stop_gossiping() on the shard-0 instance. So by and large this patch renames a few stop_gossiping() calls into .shutdown() ones. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2021-09-22 13:13:06 +03:00

1 2 3 4 5 ...

28384 Commits