scylladb

Author	SHA1	Message	Date
Botond Dénes	b247f29881	Merge 'De-static system_keyspace::get_{saved\|local}_tokens()' from Pavel Emelyanov Yet another user of global qctx object. Making the method(s) non-static requires pushing the system_keyspace all the way down to size_estimate_virtual_reader and a small update of the cql_test_env Closes #11738 * github.com:scylladb/scylladb: system_keyspace: Make get_{local\|saved}_tokens non static size_estimates_virtual_reader: Pass sys_ks argument to get_local_ranges() cql_test_env: Keep sharded<system_keyspace> reference size_estimate_virtual_reader: Keep system_keyspace reference system_keyspace: Pass sys_ks argument to install_virtual_readers() system_keyspace: Make make() non-static distributed_loader: Pass sys_ks argument to init_system_keyspace() system_keyspace: Remove dangling forward declaration	2022-10-07 11:28:32 +03:00
Avi Kivity	20bad62562	Merge 'Detect and record large collections' from Benny Halevy This series adds support for detecting collections that have too many items and recording them in `system.large_cells`. A configuration variable was added to db/config: `compaction_collection_items_count_warning_threshold` set by default to 10000. Collections that have more items than this threshold will be warned about and will be recorded as a large cell in the `system.large_cells` table. Documentation has been updated respectively. A new column was added to system.large_cells: `collection_items`. Similar to the `rows` column in system.large_partition, `collection_items` holds the number of items in a collection when the large cell is a collection, or 0 if it isn't. Note that the collection may be recorded in system.large_cells either due to its size, like any other cell, and/or due to the number of items in it, if it cross the said threshold. Note that #11449 called for a new system.large_collections table, but extending system.large_cells follows the logic of system.large_partitions is a smaller change overall, hence it was preferred. Since the system keyspace schema is hard coded, the schema version of system.large_cells was bumped, and since the change is not backward compatible, we added a cluster feature - `LARGE_COLLECTION_DETECTION` - to enable using it. The large_data_handler large cell detection record function will populate the new column only when the new cluster feature is enabled. In addition, unit tests were added in sstable_3_x_test for testing large cells detection by cell size, and large_collection detection by the number of items. Closes #11449 Closes #11674 * github.com:scylladb/scylladb: sstables: mx/writer: optimize large data stats members order sstables: mx/writer: keep large data stats entry as members db: large_data_handler: dynamically update config thresholds utils/updateable_value: add transforming_value_updater db/large_data_handler: cql_table_large_data_handler: record large_collections db/large_data_handler: pass ref to feature_service to cql_table_large_data_handler db/large_data_handler: cql_table_large_data_handler: move ctor out of line docs: large-rows-large-cells-tables: fix typos db/system_keyspace: add collection_elements column to system.large_cells gms/feature_service: add large_collection_detection cluster feature test: sstable_3_x_test: add test_sstable_too_many_collection_elements test: lib: simple_schema: add support for optional collection column test: lib: simple_schema: build schema in ctor body test: lib: simple_schema: cql: define s1 as static only if built this way db/large_data_handler: maybe_record_large_cells: consider collection_elements db/large_data_handler: debug cql_table_large_data_handler::delete_large_data_entries sstables: mx/writer: pass collection_elements to writer::maybe_record_large_cells sstables: mx/writer: add large_data_type::elements_in_collection db/large_data_handler: get the collection_elements_count_threshold db/config: add compaction_collection_elements_count_warning_threshold test: sstable_3_x_test: add test_sstable_write_large_cell test: sstable_3_x_test: pass cell_threshold_bytes to large_data_handler test: sstable_3_x_test: large_data_handler: prepare callback for testing large_cells test: sstable_3_x_test: large_data tests: use BOOST_REQUIRE_[GL]T test: sstable_3_x_test: test_sstable_log_too_many_rows: use tests::random	2022-10-06 18:28:21 +03:00
Avi Kivity	62a4d2d92b	Merge 'Preliminary changes for multiple Compaction Groups' from Raphael "Raph" Carvalho What's contained in this series: - Refactored compaction tests (and utilities) for integration with multiple groups - The idea is to write a new class of tests that will stress multiple groups, whereas the existing ones will still stress a single group. - Fixed a problem when cloning compound sstable set (cannot be triggered today so I didn't open a GH issue) - Many changes in replica::table for allowing integration with multiple groups Next: - Introduce for_each_compaction_group() for iterating over groups wherever needed. - Use for_each_compaction_group() in replica::table operations spanning all groups (API, readers, etc). - Decouple backlog tracker from compaction strategy, to allow for backlog isolation across groups - Introduce static option for defining number of compaction groups and implement function to map a token to its respective group. - Testing infrastructure for multiple compaction groups (helpful when testing the dynamic behavior: i.e. merging / splitting). Closes #11592 * github.com:scylladb/scylladb: sstable_resharding_test: Switch to table_for_tests replica: Move compacted_undeleted_sstables into compaction group replica: Use correct compaction_group in try_flush_memtable_to_sstable() replica: Make move_sstables_from_staging() robust and compaction group friendly test: Rename column_family_for_tests to table_for_tests sstable_compaction_test: Use column_family_for_tests::as_table_state() instead test: Don't expose compound set in column_family_for_tests test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user() sstable_compaction_test: Merge table_state_for_test into column_family_for_tests sstable_compaction_test: use table_state_for_test itself in fully_expired_sstables() sstable_compaction_test: Switch to table_state in compact_sstables() sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests	2022-10-06 18:23:47 +03:00
Pavel Emelyanov	4c099bb3ed	cql_test_env: Keep sharded<system_keyspace> reference There's a test_get_local_ranges() call in size-estimate reader which will need system keyspace reference. There's no other place for tests to get it from but the cql_test_env thing Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:59:21 +03:00
Pavel Emelyanov	9f79525f8e	distributed_loader: Pass sys_ks argument to init_system_keyspace() It's final destination is virtual tabls registration code called from init_system_keyspace() eventually Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-06 17:55:03 +03:00
Kamil Braun	962ee9ba7b	Merge 'Make raft_group0 -> system_keyspace dependency explicit' from Pavel Emelyanov The raft_group0 code needs system_keyspace and now it gets one from gossiper. This gossiper->system_keyspace dependency is in fact artificial, gossiper doesn't need system ks, it's there only to let raft and snitch call gossiper.get_system_keyspace(). This makes raft use system ks directly, snitch is patched by another branch Closes #11729 * github.com:scylladb/scylladb: raft_group0: Use local reference raft_group0: Add system keyspace reference	2022-10-06 13:49:26 +02:00
Tomasz Grabiec	023f78d6ae	test: lib: random_mutation_generator: Introduce a switch for generating simpler mutations for easier debugging Closes #11731	2022-10-06 13:49:26 +02:00
Raphael S. Carvalho	7d82373e3a	test: Rename column_family_for_tests to table_for_tests To avoid confusion, as replica::column_family was already renamed to replica::table. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	5a028ca4dc	test: Don't expose compound set in column_family_for_tests The compound set shouldn't be exposed in main_sstables() because once we complete the switch to column_family_for_tests::table_state, can happen compaction will try to remove or add elements to its set snapshot, and compound set isn't allowed to either ops. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	b16d6c55b1	test: Implement column_family_for_tests::table_state::is_auto_compaction_disabled_by_user() Needed once we switch to column_family_for_tests::table_state, so unit tests relying on correct value will still work Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	a6d24a763a	sstable_compaction_test: Merge table_state_for_test into column_family_for_tests This change will make table_state_for_test the table_state of column_family_for_tests. Today, an unit test has to keep a reference to them both and logically couple them, but that's error prone. This change is also important when replica::table supports multiple compaction groups, so unit tests won't have to directly reference the table_state of table, but rather use the one managed by column_family_for_tests. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	a6affea008	sstable_compaction_test: Switch to table_state in compact_sstables() The switch is important once we have multiple compaction groups, as a single table may own several groups. There will no longer be a replica::table::as_table_state(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:19 -03:00
Raphael S. Carvalho	2aa6518486	sstable_compaction_test: Reduce boilerplate by switching to column_family_for_tests Lots of boilerplate is reduced, and will also help to complete the switch from replica::table to compaction::table_state in the unit tests. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-05 21:37:18 -03:00
Pavel Emelyanov	8570fe3c30	raft_group0: Add system keyspace reference The sharded<system_keyspace> is already started by the time raft_group0 is created Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-10-05 17:35:13 +03:00
Botond Dénes	4c13328788	Merge 'Return all sstables in table::get_sstable_set()' from Raphael "Raph" Carvalho This fixes a regression introduced by `1e7a444`, where table::get_sstable_set() isn't exposing all sstables, but rather only the ones in the main set. That causes user of the interface, such as get_sstables_by_partition_key() (used by API to return sstable name list which contains a particular key), to miss files in the maintenance set. Fixes https://github.com/scylladb/scylladb/issues/11681. Closes #11682 * github.com:scylladb/scylladb: replica: Return all sstables in table::get_sstable_set() sstables: Fix cloning of compound_sstable_set	2022-10-05 06:55:50 +03:00
Raphael S. Carvalho	827750c142	replica: Return all sstables in table::get_sstable_set() get_sstable_set() as its name implies is not confined to the main or maintenance set, nor to a specific compaction group, so let's make it return the compound set which spans all groups, meaning all sstables tracked by a table will be returned. This is a regression introduced in `1e7a444`. It affects the API to return sstable list containing a partition key, as sstables in maintenance would be missed, fooling users of the API like tools that could trust the output. Each compaction group is returning the main and maintenance set in table_state's main_sstable_set() and maintenance_sstable_set(), respectively. Fixes #11681. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-10-04 10:43:27 -03:00
Benny Halevy	3c11937b00	test: lib: simple_schema: add support for optional collection column Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:06 +03:00
Benny Halevy	7b5f2d2e53	test: lib: simple_schema: build schema in ctor body Rather when initializing _s. Prepare for adding an optional collection column. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:06 +03:00
Benny Halevy	db01641a44	test: lib: simple_schema: cql: define s1 as static only if built this way Keep the with_static ctor parameter as private member to be used by the cql() method to define s1 either as static or not. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-10-04 08:42:05 +03:00
Tomasz Grabiec	9dae2b9c02	Merge 'mutation_fragment_stream_validator: various API improvements' from Botond Dénes The low-level `mutation_fragment_stream_validator` gets `reset()` methods that until now only the high-level `mutation_fragment_stream_validating_filter` had. Active tombstone validation is pushed down to the low level validator. The low level validator, which was a pain to use until now due to being very fussy on which subset of its API one used, is made much more robust, not requiring the user to stick to a subset of its API anymore. Closes #11614 * github.com:scylladb/scylladb: mutation_fragment_stream_validator: make interface more robust mutation_fragment_stream_validator: add reset() to validating filter mutation_fragment_stream_validator: move active tomsbtone validation into low level validator	2022-10-03 16:23:46 +02:00
Botond Dénes	060dda8e00	Merge 'Reduce dependencies on large data handler header' from Benny Halevy Reduce the false dependencies on db/large_data_handler.hh by not including it from commonly used header files, and rather including it only in the source files that actually need it. The is in preparation for https://github.com/scylladb/scylladb/issues/11449 Closes #11654 * github.com:scylladb/scylladb: test: lib: do not include db/large_data_handler.hh in test_service.hh test: lib: move sstable test_env::impl ctor out of line sstables: do not include db/large_data_handler.hh in sstables.hh api/column_family: add include db/system_keyspace.hh	2022-09-30 13:27:38 +03:00
Tomasz Grabiec	5268f0f837	test: lib: random_mutation_generator: Don't generate mutations with marker uncompacted with shadowable tombstone The generator was first setting the marker then applied tombstones. The marker was set like this: row.marker() = random_row_marker(); Later, when shadowable tombstones were applied, they were compacted with the marker as expected. However, the key for the row was chosen randomly in each iteration and there are multiple keys set, so there was a possibility of a key clash with an earlier row. This could override the marker without applying any tombstones, which is conditional on random choice. This could generate rows with markers uncompacted with shadowable tombstones. This broken row_cache_test::test_concurrent_reads_and_eviction on comparison between expected and read mutations. The latter was compacted because it went through an extra merge path, which compacts the row. Fix by making sure there are no key clashes. Closes #11663	2022-09-30 11:27:01 +03:00
Benny Halevy	776b009c0f	test: lib: do not include db/large_data_handler.hh in test_service.hh It was needed for defining and referencing nop_lp_handler and in sstable_3_x_test for testing the large_data_handler. Remove the include from the commonly used header file to reduce the false dependencies on large_data_handler.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 18:36:16 +03:00
Benny Halevy	678d88576b	test: lib: move sstable test_env::impl ctor out of line To prepare for removing the include of db/large_data_handler.hh from test/lib/test_services.hh Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-29 18:35:12 +03:00
Botond Dénes	a8cbf66573	mutation_fragment_stream_validator: move active tomsbtone validation into low level validator Currently the active range tombstone change is validated in the high level `mutation_fragment_stream_validating_stream`, meaning that users of the low-level `mutation_fragment_stream_validator` don't benefit from checking that tombstones are properly closed. This patch moves the validation down to the low-level validator (which is what the high-level one uses under the hood too), and requires all users to pass information about changes to the active tombstone for each fragment.	2022-09-26 10:17:27 +03:00
Raphael S. Carvalho	2f52698a26	test: Make fake sstables implicitly belong to current shard Fake SSTables will be implicitly owned by the shard that created them, allowing them to be called on procedures that assert the SSTables are owned by the current shard, like the table's one that rebuilds the sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-19 12:05:24 -03:00
Raphael S. Carvalho	697f200319	test: Make it clearer that sstables::test::set_values() modify data size By adding a param with default value, we make it clear in the interface that the procedure modifies sstable data size. It can happen one calls this function without noticing it overrides the data size previously set using a different function. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-19 12:01:24 -03:00
Botond Dénes	05ef13a627	Merge 'Add support to split large partitions across SSTables' from Raphael "Raph" Carvalho Introduces support to split large partitions during compaction. Today, compaction can only split input data at partition boundary, so a large partition is stored in a single file. But that can cause many problems, like memory pressure (e.g.: https://github.com/scylladb/scylladb/issues/4217), and incremental compaction can also not fulfill its promise as the file storing the large partition can only be released once exhausted. The first step was to add clustering range metadata for first and last partition keys (retrieved from promoted index), which is crucial to determine disjointness at clustering level, and also the order at which the disjoint files should be opened for incremental reading. The second step was to extend sstable_run to look at clustering dimension, so a set of files storing disjoint ranges for the same partition can live in the same sstable run. The final step was to introduce the option for compaction to split large partition being written if it has exceeded the size threshold. What's next? Following this series, a reader will be implemented for sstable_run that will incrementally open the readers. It can be safely built on the assumption of the disjoint invariant after the second step aforementioned. Closes #11233 * github.com:scylladb/scylladb: test: Add test for large partition splitting on compaction compaction: Add support to split large partitions sstable: Extend sstable_run to allow disjointness on the clustering level sstables: simplify will_introduce_overlapping() test: move sstable_run_disjoint_invariant_test into sstable_datafile_test test: lib: Fix inefficient merging of mutations in make_sstable_containing() sstables: Keep track of first partition's first pos and last partition's last pos sstables: Rename min/max position_range to a descriptive name sstables_manager: Add sstable metadata reader concurrency semaphore sstables: Add ability to find first or last position in a partition	2022-09-15 16:08:56 +03:00
Kamil Braun	728161003a	Merge 'raft server, abort on background errors' from Gusev Petr Halted background fibers render raft server effectively unusable, so report this explicitly to the clients. Fix: #11352 Closes #11370 * github.com:scylladb/scylladb: raft server, status metric raft server, abort group0 server on background errors raft server, provide a callback to handle background errors raft server, check aborted state on public server public api's	2022-09-15 14:12:11 +02:00
Raphael S. Carvalho	20a6483678	test: Add test for large partition splitting on compaction Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:23:19 -03:00
Raphael S. Carvalho	e1560c6b7f	test: lib: Fix inefficient merging of mutations in make_sstable_containing() make_sstable_containing() was absurdly slow when merging thousands of mutations belonging to the same key, as it was unnecessarily copying the mutation for every merge, producing bad complexity. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Raphael S. Carvalho	e099a9bf3b	sstables_manager: Add sstable metadata reader concurrency semaphore Let's introduce a reader_concurrency_semaphore for reading sstable metadata, to avoid an OOM due to unlimited concurrency. The concurrency on startup is not controlled, so it's important to enforce a limit on the amount of memory used by the parallel readers. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-14 13:09:51 -03:00
Petr Gusev	4ff0807cd0	raft server, status metric	2022-09-13 19:34:22 +04:00
Kamil Braun	2fe3e67a47	gms: feature_service: don't distinguish between 'known' and 'supported' features `feature_service` provided two sets of features: `known_feature_set` and `supported_feature_set`. The purpose of both and the distinction between them was unclear and undocumented. The 'supported' features were gossiped by every node. Once a feature is supported by every node in the cluster, it becomes 'enabled'. This means that whatever piece of functionality is covered by the feature, it can by used by the cluster from now on. The 'known' set was used to perform feature checks on node start; if the node saw that a feature is enabled in the cluster, but the node does not 'know' the feature, it would refuse to start. However, if the feature was 'known', but wasn't 'supported', the node would not complain. This means that we could in theory allow the following scenario: 1. all nodes support feature X. 2. X becomes enabled in the cluster. 3. the user changes the configuration of some node so feature X will become unsupported but still known. 4. The node restarts without error. So now we have a feature X which is enabled in the cluster, but not every node supports it. That does not make sense. It is not clear whether it was accidental or purposeful that we used the 'known' set instead of the 'supported' set to perform the feature check. What I think is clear, is that having two sets makes the entire thing unnecessarily complicated and hard to think about. Fortunately, at the base to which this patch is applied, the sets are always the same. So we can easily get rid of one of them. I decided that the name which should stay is 'supported', I think it's more specific than 'known' and it matches the name of the corresponding gossiper application state. Closes #11512	2022-09-12 13:09:12 +03:00
Raphael S. Carvalho	dfa7273127	test: sstable_utils: Set data size fields for fake SSTable So methods that look at data size and require it to be higher than 0 will work on fake SSTables created using set_values(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-09-11 14:26:59 -03:00
Kamil Braun	dba595d347	Merge 'Minimal implementation of Broadcast Tables' from Mikołaj Grzebieluch Broadcast tables are tables for which all statements are strongly consistent (linearizable), replicated to every node in the cluster and available as long as a majority of the cluster is available. If a user wants to store a “small” volume of metadata that is not modified “too often” but provides high resiliency against failures and strong consistency of operations, they can use broadcast tables. The main goal of the broadcast tables project is to solve problems which need to be solved when we eventually implement general-purpose strongly consistent tables: designing the data structure for the Raft command, ensuring that the commands are idempotent, handling snapshots correctly, and so on. In this MVP (Minimum Viable Product), statements are limited to simple SELECT and UPDATE operations on the built-in table. In the future, other statements and data types will be available but with this PR we can already work on features like idempotent commands or snapshotting. Snapshotting is not handled yet which means that restarting a node or performing too many operations (which would cause a snapshot to be created) will give incorrect results. In a follow-up, we plan to add end-to-end Jepsen tests (https://jepsen.io/). With this PR we can already simulate operations on lists and test linearizability in linear complexity. This can also test Scylla's implementation of persistent storage, failure detector, RPC, etc. Design doc: https://docs.google.com/document/d/1m1IW320hXtsGulzSTSHXkfcBKaG5UlsxOpm6LN7vWOc/edit?usp=sharing Closes #11164 * github.com:scylladb/scylladb: raft: broadcast_tables: add broadcast_kv_store test raft: broadcast_tables: add returning query result raft: broadcast_tables: add execution of intermediate language raft: broadcast_tables: add compilation of cql to intermediate language raft: broadcast_tables: add definition of intermediate language db: system_keyspace: add broadcast_kv_store table db: config: add BROADCAST_TABLES feature flag	2022-09-09 18:05:37 +02:00
Mikołaj Grzebieluch	82df8a9905	raft: broadcast_tables: add compilation of cql to intermediate language We decided to extend `cql_statement` hierarchy with `strongly_consistent_modification_statement` and `strongly_consistent_select_statement`. Statements operating on system.broadcast_kv_store will be compiled to these new subclasses if BROADCAST_TABLES flag is enabled. If the query is executed on a shard other than 0 it's bounced to that shard.	2022-09-08 15:25:36 +02:00
Benny Halevy	0627667a06	mutation_partition: compact_for_compaction: get tombstone_gc_state And pass down to `do_compact`. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-09-07 07:43:15 +03:00
Kefu Chai	a5e696fab8	storage_service, test: drop unused storage_service_config this setting was removed back in `dcdd207349`, so despite that we are still passing `storage_service_config` to the ctor of `storage_service`, `storage_service::storage_service()` just drops it on the floor. in this change, `storage_service_config` class is removed, and all places referencing it are updated accordingly. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11415	2022-08-31 19:49:13 +03:00
Tomasz Grabiec	1d0264e1a9	Merge 'Implement Raft upgrade procedure' from Kamil Braun Start with a cluster with Raft disabled, end up with a cluster that performs schema operations using group 0. Design doc: https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/ (TODO: replace this with .md file - we can do it as a follow-up) The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations. With the procedure comes a recovery mode in case the upgrade procedure gets stuck (and it may if we lose a node during recovery - the procedure, to correctly establish a single group 0 cluster, requires contacting every node). This recovery mode can also be used to recover clusters with group 0 already established if they permanently lose a majority of nodes - killing two birds with one stone. Details in the last commit message. Read the design doc, then read the commits in topological order for best reviewing experience. --- I did some manual tests: upgrading a cluster, using the cluster to add nodes, remove nodes (both with `decommission` and `removenode`), replacing nodes. Performing recovery. As a follow-up, we'll need to implement tests using the new framework (after it's ready). It will be easy to test upgrades and recovery even with a single Scylla version - we start with a cluster with the RAFT flag disabled, then rolling-restart while enabling the flag (and recovery is done through simple CQL statements). Closes #10835 * github.com:scylladb/scylladb: service/raft: raft_group0: implement upgrade procedure service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` service/raft: raft_group0: introduce local loggers for group 0 and upgrade service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb service/raft: raft_group0_client: prepare for upgrade procedure service/raft: introduce `group0_upgrade_state` db: system_keyspace: introduce `load_peers` idl-compiler: introduce cancellable verbs message: messaging_service: cancellable version of `send_schema_check`	2022-08-25 11:32:06 +03:00
Kamil Braun	e350e37605	service/raft: raft_group0: implement upgrade procedure A listener is created inside `raft_group0` for acting when the SUPPORTS_RAFT feature is enabled. The listener is established after the node enters NORMAL status (in `raft_group0::finish_setup_after_join()`, called at the end of `storage_service::join_cluster()`). The listener starts the `upgrade_to_group0` procedure. The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled (see earlier commit which implemented this logic) - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations (only those for now). The devil lies in the details, and the implementation is ugly compared to this nice description; for example there are many retry loops for handling intermittent network failures. Read the code. `leave_group0` and `remove_group0` were adjusted to handle the upgrade procedure being run correctly; if necessary, they will wait for the procedure to finish. If the upgrade procedure gets stuck (and it may, since it requires all nodes to be available to contact them to correctly establish a single group 0 raft cluster); or if a running cluster permanently loses a majority of nodes, causing group 0 unavailability; the cluster admin is not left without help. We introduce a recovery mode, which allows the admin to completely get rid of traces of existing group 0 and restart the upgrade procedure - which will establish a new group 0. This works even in clusters that never upgraded but were bootstrapped using group 0 from scratch. To do that, the admin does the following on every node: - writes 'recovery' under 'group0_upgrade_state' key in `system.scylla_local` table, - truncates the `system.discovery` table, - truncates the `system.group0_history` table, - deletes group 0 ID and group 0 server ID from `system.scylla_local` (the keys are `raft_group0_id` and `raft_server_id` then the admin performs a rolling restart of their cluster. The nodes restart in a "group 0 recovery mode", which simply means that the nodes won't try to perform any group 0 operations. Then the admin calls `removenode` to remove the nodes that are down. Finally, the admin removes the `group0_upgrade_state` key from `system.scylla_local`, rolling-restarts the cluster, and the cluster should establish group 0 anew. Note that this recovery procedure will have to be extended when new stuff is added to group 0 - like topology change state. Indeed, observe that a minority of nodes aren't able to receive committed entries from a leader, so they may end up in inconsistent group 0 states. It wouldn't be safe to simply create group 0 on those nodes without first ensuring that they have the same state from which group 0 will start. Right now the state only consist of schema tables, and the upgrade procedure ensures to synchronize them, so even if the nodes started in inconsistent schema states, group 0 will correctly be established. (TODO: create a tracking issue? something needs to remind us of this whenever we extend group 0 with new stuff...)	2022-08-23 13:51:01 +02:00
Botond Dénes	331033adae	Merge 'Fix frozen mutation consume ordering' from Benny Halevy Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Closes #11269 * github.com:scylladb/scylladb: mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable frozen_mutation: consume and consume_gently in-order frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen	2022-08-23 06:37:04 +03:00
Mikołaj Sielużycki	b5380baf8a	frozen_mutation: consume and consume_gently in-order Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 20:12:20 +03:00
Piotr Sarna	484004e766	Merge 'Fix mutation commutativity with shadowable tombstone' from Tomasz Grabiec This series fixes lack of mutation associativity which manifests as sporadic failures in row_cache_test.cc::test_concurrent_reads_and_eviction due to differences in mutations applied and read. No known production impact. Refs https://github.com/scylladb/scylladb/issues/11307 Closes #11312 * github.com:scylladb/scylladb: test: mutation_test: Add explicit test for mutation commutativity test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones db: mutation_partition: Drop unnecessary maybe_shadow() db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone mutation_partition: row: make row marker shadowing symmetric	2022-08-20 16:46:32 +02:00
Kamil Braun	43687be1f1	service/raft: raft_group0_client: prepare for upgrade procedure Now, whether an 'group 0 operation' (today it means schema change) is performed using the old or new methods, doesn't depend on the local RAFT fature being enabled, but on the state of the upgrade procedure. In this commit the state of the upgrade is always `use_pre_raft_procedures` because the upgrade procedure is not implemented yet. But stay tuned. The upgrade procedure will need certain guarantees: at some point it switches from `use_pre_raft_procedures` to `synchronize` state. During `synchronize` schema changes must be disabled, so the procedure can ensure that schema is in sync across the entire cluster before establishing group 0. Thus, when the switch happens, no schema change can be in progress. To handle all this weirdness we introduce `_upgrade_lock` and `get_group0_upgrade_state` which takes this lock whenever it returns `use_pre_raft_procedures`. Creating a `group0_guard` - which happens at the start of every group 0 operation - will take this lock, and the lock holder shall be stored inside the guard (note: the holder only holds the lock if `use_pre_raft_procedures` was returned, no need to hold it for other cases). Because `group0_guard` is held for the entire duration of a group 0 operation, and because the upgrade procedure will also have to take this lock whenever it wants to change the upgrade state (it's an rwlock), this ensures that no group 0 operation that uses the old ways is happening when we change the state. We also implement `wait_until_group0_upgraded` using a condition variable. It will be used by certain methods during upgrade (later commits; stay tuned). Some additional comments were written.	2022-08-19 19:15:19 +02:00
Benny Halevy	7747b8fa33	sstables: define run_identifier as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11321	2022-08-18 19:03:10 +03:00
Tomasz Grabiec	3d9efee3bf	test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones Given 3 row mutations: m1 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } m2 = { marker: {row_marker: timestamp=-9223372036854775805} } m3 = { tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} } We get different shadowable tombstones depending on the order of merging: (m1 + m2) + m3 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} m1 + (m2 + m3) = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } The reason is that in the second case the shadowable tombstone in m3 is shadwed by the row marker in m2. In the first case, the marker in m2 is cancelled by the dead marker in m1, so shadowable tombstone in m3 is not cancelled (the marker in m1 does not cancel because it's dead). This wouldn't happen if the dead marker in m1 was accompanied by a hard tombstone of the same timestamp, which would effectively make the difference in shadowable tombstones irrelevant. Found by row_cache_test.cc::test_concurrent_reads_and_eviction. I'm not sure if this situation can be reached in practice (dead marker in mv table but no row tombstone). Work it around for tests by producing a row tombstone if there is a dead marker. Refs #11307	2022-08-17 17:39:54 +02:00
Botond Dénes	c8ef356859	test/lib: move convenience table config factory to sstable_test_env All users of `column_family_test_config()`, get the semaphore parameter for it from `sstable_test_env`. It is clear that the latter serves as the storage space for stable objects required by the table config. This patch just enshrines this fact by moving the config factory method to `sstable_test_env`, so it can just get what it needs from members.	2022-08-15 11:23:59 +03:00
Botond Dénes	c0e017e0f7	test/lib/sstable_test_env: move members to impl struct All present members of sstable_test_env are std::unique_ptr<>:s because they require stable addresses. This makes their handling somewhat awkward. Move all of them into an internal `struct impl` and make that member a unique ptr.	2022-08-15 11:20:09 +03:00
Botond Dénes	a9f296ed47	test/lib/sstable_utils: use test_env::do_with_async() Instead of manually instantiating test_env.	2022-08-15 11:19:27 +03:00

1 2 3 4 5 ...

658 Commits