scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 09:00:35 +00:00

Author	SHA1	Message	Date
Kefu Chai	a5e696fab8	storage_service, test: drop unused storage_service_config this setting was removed back in `dcdd207349`, so despite that we are still passing `storage_service_config` to the ctor of `storage_service`, `storage_service::storage_service()` just drops it on the floor. in this change, `storage_service_config` class is removed, and all places referencing it are updated accordingly. Signed-off-by: Kefu Chai <tchaikov@gmail.com> Closes #11415	2022-08-31 19:49:13 +03:00
Tomasz Grabiec	1d0264e1a9	Merge 'Implement Raft upgrade procedure' from Kamil Braun Start with a cluster with Raft disabled, end up with a cluster that performs schema operations using group 0. Design doc: https://docs.google.com/document/d/1PvZ4NzK3S0ohMhyVNZZ-kCxjkK5URmz1VP65rrkTOCQ/ (TODO: replace this with .md file - we can do it as a follow-up) The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations. With the procedure comes a recovery mode in case the upgrade procedure gets stuck (and it may if we lose a node during recovery - the procedure, to correctly establish a single group 0 cluster, requires contacting every node). This recovery mode can also be used to recover clusters with group 0 already established if they permanently lose a majority of nodes - killing two birds with one stone. Details in the last commit message. Read the design doc, then read the commits in topological order for best reviewing experience. --- I did some manual tests: upgrading a cluster, using the cluster to add nodes, remove nodes (both with `decommission` and `removenode`), replacing nodes. Performing recovery. As a follow-up, we'll need to implement tests using the new framework (after it's ready). It will be easy to test upgrades and recovery even with a single Scylla version - we start with a cluster with the RAFT flag disabled, then rolling-restart while enabling the flag (and recovery is done through simple CQL statements). Closes #10835 * github.com:scylladb/scylladb: service/raft: raft_group0: implement upgrade procedure service/raft: raft_group0: extract `tracker` from `persistent_discovery::run` service/raft: raft_group0: introduce local loggers for group 0 and upgrade service/raft: raft_group0: introduce GET_GROUP0_UPGRADE_STATE verb service/raft: raft_group0_client: prepare for upgrade procedure service/raft: introduce `group0_upgrade_state` db: system_keyspace: introduce `load_peers` idl-compiler: introduce cancellable verbs message: messaging_service: cancellable version of `send_schema_check`	2022-08-25 11:32:06 +03:00
Kamil Braun	e350e37605	service/raft: raft_group0: implement upgrade procedure A listener is created inside `raft_group0` for acting when the SUPPORTS_RAFT feature is enabled. The listener is established after the node enters NORMAL status (in `raft_group0::finish_setup_after_join()`, called at the end of `storage_service::join_cluster()`). The listener starts the `upgrade_to_group0` procedure. The procedure, on a high level, works as follows: - join group 0 - wait until every peer joined group 0 (peers are taken from `system.peers` table) - enter `synchronize` upgrade state, in which group 0 operations are disabled (see earlier commit which implemented this logic) - wait until all members of group 0 entered `synchronize` state or some member entered the final state - synchronize schema by comparing versions and pulling if necessary - enter the final state (`use_new_procedures`), in which group 0 is used for schema operations (only those for now). The devil lies in the details, and the implementation is ugly compared to this nice description; for example there are many retry loops for handling intermittent network failures. Read the code. `leave_group0` and `remove_group0` were adjusted to handle the upgrade procedure being run correctly; if necessary, they will wait for the procedure to finish. If the upgrade procedure gets stuck (and it may, since it requires all nodes to be available to contact them to correctly establish a single group 0 raft cluster); or if a running cluster permanently loses a majority of nodes, causing group 0 unavailability; the cluster admin is not left without help. We introduce a recovery mode, which allows the admin to completely get rid of traces of existing group 0 and restart the upgrade procedure - which will establish a new group 0. This works even in clusters that never upgraded but were bootstrapped using group 0 from scratch. To do that, the admin does the following on every node: - writes 'recovery' under 'group0_upgrade_state' key in `system.scylla_local` table, - truncates the `system.discovery` table, - truncates the `system.group0_history` table, - deletes group 0 ID and group 0 server ID from `system.scylla_local` (the keys are `raft_group0_id` and `raft_server_id` then the admin performs a rolling restart of their cluster. The nodes restart in a "group 0 recovery mode", which simply means that the nodes won't try to perform any group 0 operations. Then the admin calls `removenode` to remove the nodes that are down. Finally, the admin removes the `group0_upgrade_state` key from `system.scylla_local`, rolling-restarts the cluster, and the cluster should establish group 0 anew. Note that this recovery procedure will have to be extended when new stuff is added to group 0 - like topology change state. Indeed, observe that a minority of nodes aren't able to receive committed entries from a leader, so they may end up in inconsistent group 0 states. It wouldn't be safe to simply create group 0 on those nodes without first ensuring that they have the same state from which group 0 will start. Right now the state only consist of schema tables, and the upgrade procedure ensures to synchronize them, so even if the nodes started in inconsistent schema states, group 0 will correctly be established. (TODO: create a tracking issue? something needs to remind us of this whenever we extend group 0 with new stuff...)	2022-08-23 13:51:01 +02:00
Botond Dénes	331033adae	Merge 'Fix frozen mutation consume ordering' from Benny Halevy Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Closes #11269 * github.com:scylladb/scylladb: mutation_partition_view: make mutation_partition_view_virtual_visitor stoppable frozen_mutation: consume and consume_gently in-order frozen_mutation: frozen_mutation_consumer_adaptor: rename rt to rtc frozen_mutation: frozen_mutation_consumer_adaptor: return early when flush returns stop_iteration::yes frozen_mutation: frozen_mutation_consumer_adaptor: consume static row unconditionally frozen_mutation: frozen_mutation_consumer_adaptor: flush current_row before rt_gen	2022-08-23 06:37:04 +03:00
Mikołaj Sielużycki	b5380baf8a	frozen_mutation: consume and consume_gently in-order Currently, frozen_mutation is not consumed in position_in_partition order as all range tombstones are consumed before all rows. This violates the range_tombstone_generator invariants as its lower_bound needs to be monotonically increasing. Fix this by adding mutation_partition_view::accept_ordered and rewriting do_accept_gently to do the same, both making sure to consume the range tombstones and clustering rows in position_in_partition order, similar to the mutation consume_clustering_fragments function. Add a unit test that verifies that. Fixes #11198 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-22 20:12:20 +03:00
Piotr Sarna	484004e766	Merge 'Fix mutation commutativity with shadowable tombstone' from Tomasz Grabiec This series fixes lack of mutation associativity which manifests as sporadic failures in row_cache_test.cc::test_concurrent_reads_and_eviction due to differences in mutations applied and read. No known production impact. Refs https://github.com/scylladb/scylladb/issues/11307 Closes #11312 * github.com:scylladb/scylladb: test: mutation_test: Add explicit test for mutation commutativity test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones db: mutation_partition: Drop unnecessary maybe_shadow() db: mutation_partition: Maintain shadowable tombstone invariant when applying a hard tombstone mutation_partition: row: make row marker shadowing symmetric	2022-08-20 16:46:32 +02:00
Kamil Braun	43687be1f1	service/raft: raft_group0_client: prepare for upgrade procedure Now, whether an 'group 0 operation' (today it means schema change) is performed using the old or new methods, doesn't depend on the local RAFT fature being enabled, but on the state of the upgrade procedure. In this commit the state of the upgrade is always `use_pre_raft_procedures` because the upgrade procedure is not implemented yet. But stay tuned. The upgrade procedure will need certain guarantees: at some point it switches from `use_pre_raft_procedures` to `synchronize` state. During `synchronize` schema changes must be disabled, so the procedure can ensure that schema is in sync across the entire cluster before establishing group 0. Thus, when the switch happens, no schema change can be in progress. To handle all this weirdness we introduce `_upgrade_lock` and `get_group0_upgrade_state` which takes this lock whenever it returns `use_pre_raft_procedures`. Creating a `group0_guard` - which happens at the start of every group 0 operation - will take this lock, and the lock holder shall be stored inside the guard (note: the holder only holds the lock if `use_pre_raft_procedures` was returned, no need to hold it for other cases). Because `group0_guard` is held for the entire duration of a group 0 operation, and because the upgrade procedure will also have to take this lock whenever it wants to change the upgrade state (it's an rwlock), this ensures that no group 0 operation that uses the old ways is happening when we change the state. We also implement `wait_until_group0_upgraded` using a condition variable. It will be used by certain methods during upgrade (later commits; stay tuned). Some additional comments were written.	2022-08-19 19:15:19 +02:00
Benny Halevy	7747b8fa33	sstables: define run_identifier as a strong tagged_uuid type Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11321	2022-08-18 19:03:10 +03:00
Tomasz Grabiec	3d9efee3bf	test: random_mutation_generator: Workaround for non-associativity of mutations with shadowable tombstones Given 3 row mutations: m1 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } m2 = { marker: {row_marker: timestamp=-9223372036854775805} } m3 = { tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} } We get different shadowable tombstones depending on the order of merging: (m1 + m2) + m3 = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775806, deletion_time=2}, {tombstone: none}} m1 + (m2 + m3) = { marker: {row_marker: dead timestamp=-9223372036854775803}, tombstone: {row_tombstone: {shadowable tombstone: timestamp=-9223372036854775807, deletion_time=0}, {tombstone: none}} } The reason is that in the second case the shadowable tombstone in m3 is shadwed by the row marker in m2. In the first case, the marker in m2 is cancelled by the dead marker in m1, so shadowable tombstone in m3 is not cancelled (the marker in m1 does not cancel because it's dead). This wouldn't happen if the dead marker in m1 was accompanied by a hard tombstone of the same timestamp, which would effectively make the difference in shadowable tombstones irrelevant. Found by row_cache_test.cc::test_concurrent_reads_and_eviction. I'm not sure if this situation can be reached in practice (dead marker in mv table but no row tombstone). Work it around for tests by producing a row tombstone if there is a dead marker. Refs #11307	2022-08-17 17:39:54 +02:00
Botond Dénes	c8ef356859	test/lib: move convenience table config factory to sstable_test_env All users of `column_family_test_config()`, get the semaphore parameter for it from `sstable_test_env`. It is clear that the latter serves as the storage space for stable objects required by the table config. This patch just enshrines this fact by moving the config factory method to `sstable_test_env`, so it can just get what it needs from members.	2022-08-15 11:23:59 +03:00
Botond Dénes	c0e017e0f7	test/lib/sstable_test_env: move members to impl struct All present members of sstable_test_env are std::unique_ptr<>:s because they require stable addresses. This makes their handling somewhat awkward. Move all of them into an internal `struct impl` and make that member a unique ptr.	2022-08-15 11:20:09 +03:00
Botond Dénes	a9f296ed47	test/lib/sstable_utils: use test_env::do_with_async() Instead of manually instantiating test_env.	2022-08-15 11:19:27 +03:00
Benny Halevy	d295d8e280	everywhere: define locator::host_id as a strong tagged_uuid type So it can be distinguished from other uuid-based identifiers in the system. Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes #11276	2022-08-12 06:01:44 +03:00
Tomasz Grabiec	8ee5b69f80	test: row_cache: Use more narrow key range to stress overlapping reads more This makes catching issues related to concurrent access of same or adjacent entries more likely. For example, catches #11239. Closes #11260	2022-08-10 06:53:54 +03:00
Benny Halevy	257d74bb34	schema, everywhere: define and use table_id as a strong type Define table_id as a distinct utils::tagged_uuid modeled after raft tagged_id, so it can be differentiated from other uuid-class types, in particular from table_schema_version. Fixes #11207 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:09:41 +03:00
Benny Halevy	813cffc2b5	counters: counter_id: use base class create_random_id Rather than defining generate_random, and use respectively in unit tests. (It was inherited from raft::internal::tagged_id.) This allows us to shorten counter_id's definition to just using utils::tagged_uuid<struct counter_id_tag>. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-08 08:02:27 +03:00
Benny Halevy	e4e92d44ae	main: start compaction_manager as a sharded service And pass a reference to it to the database rather than having the database construct its own compaction_manager. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-08-02 07:50:15 +03:00
Aleksandra Martyniuk	7d457cffb8	scrub compaction: count validation errors for specific scrub task The number of validation errors per given compaction scrub on given shard is passed up to perform_task() function.	2022-07-29 09:35:20 +02:00
Avi Kivity	e66809d051	Merge 'Memtable flush: wait for sstable count reduction if needed' from Benny Halevy Called from try_flush_memtable_to_sstable, maybe_wait_for_sstable_count_reduction will wait for compaction to catch up with memtable flush if there the bucket to compact is inflated, having too many sstables. In that case we don't want to add fuel to the fire by creating yet another sstable. Fixes #4116 Closes #10954 * github.com:scylladb/scylla: table: Add test where compaction doesn't keep up with flush rate. compaction_manager: add maybe_wait_for_sstable_count_reduction time_window_compaction_strategy: get_sstables_for_compaction: clean up code time_window_compaction_strategy: make get_sstables_for_compaction idempotent time_window_compaction_strategy: get_sstables_for_compaction: improve debug messages leveled_manifest: pass compaction_counter as const&	2022-07-28 19:11:04 +03:00
Mikołaj Sielużycki	e0c6e1ef3c	table: Add test where compaction doesn't keep up with flush rate. The test simulates a situation where 2 threads issue flushes to 2 tables. Both issue small flushes, but one has injected reactor stalls. This can lead to a situation where lots of small sstables accumulate on disk, and, if compaction never has a chance to keep up, resources can be exhausted. (cherry picked from commit `b5684aa96d`) (cherry picked from commit `25407a7e41`)	2022-07-28 14:43:33 +03:00
Mikołaj Sielużycki	9c43f1266a	test: Move validating_consumer to test/lib/mutation_assertions.hh	2022-07-27 11:19:50 +02:00
Pavel Emelyanov	a246b6d3eb	streaming: Pass db::config& to manager constructor The stream_manager will bookkeep the streaming bandwidth option, to subscribe on its changes it needs the config reference. It would be better if it was stream_manager::config, but currently subscription on db::config::<stuff> updates is not very shard-friendly, so we need to carry the config reference itself around. Similar trouble is there for compaction_manager. The option is passed through its own config, but the config is created on each shard by database code. Stream manager config would be created once by main code on shard 0. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2022-07-19 12:18:08 +03:00
Raphael S. Carvalho	a176022272	compaction_manager: task: switch to table_state Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	b5417096e2	compaction_manager: make propagate_replacement() switch to table_state propagate_replacement is used by incremental compaction to notify ongoing compaction about sstable list updates, such that the ongoing job won't hold reference to exhausted sstables. So it needs to switch to table_state, too. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-16 21:35:06 -03:00
Raphael S. Carvalho	f52ad722f3	compaction_manager: rename table_state's get_sstable_set to main_sstable_set With compaction_manager switching to table_state, we'll need to introduce a method in table_state to return maintenance set. So better to have a descriptive name for main set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2022-07-13 11:12:33 -03:00
Nadav Har'El	cc69177dcc	config: fix printing of experimental feature list Recently we noticed a regression where with certain versions of the fmt library, SELECT value FROM system.config WHERE name = 'experimental_features' returns string numbers, like "5", instead of feature names like "raft". It turns out that the fmt library keep changing their overload resolution order when there are several ways to print something. For enum_option<T> we happen to have to conflicting ways to print it: 1. We have an explicit operator<<. 2. We have an implicit convertor to the type held by T. We were hoping that the operator<< always wins. But in fmt 8.1, there is special logic that if the type is convertable to an int, this is used before operator<<()! For experimental_features_t, the type held in it was an old-style enum, so it is indeed convertible to int. The solution I used in this patch is to replace the old-style enum in experimental_features_t by the newer and more recommended "enum class", which does not have an implicit conversion to int. I could have fixed it in other ways, but it wouldn't have been much prettier. For example, dropping the implicit convertor would require us to change a bunch of switch() statements over enum_option (and not just experimental_features_t, but other types of enum_option). Going forward, all uses of enum_option should use "enum class", not "enum". tri_mode_restriction_t was already using an enum class, and now so does experimental_features_t. I changed the examples in the comments to also use "enum class" instead of enum. This patch also adds to the existing experimental_features test a check that the feature names are words that are not numbers. Fixes #11003. Signed-off-by: Nadav Har'El <nyh@scylladb.com> Closes #11004	2022-07-11 09:17:30 +02:00
Tomasz Grabiec	c5ad05c819	db: Allow splitting initiatlization of system tables We will need some system tables to be initialized earlier in the boot so that system.scylla_local can be read before schema tables are initialized.	2022-07-06 22:08:56 +02:00
Avi Kivity	419fe65259	Revert "Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki" This reverts commit `aa8f135f64`, reversing changes made to `9a88bc260c`. The patch causes hangs during flush. Also reverts parts of `411231da75` that impacted the unit test. Fixes #10897.	2022-07-06 12:19:02 +03:00
Tomasz Grabiec	8f3349b407	test: lib: flat_mutation_reader_assertion: Add trace-level logging of read fragments Message-Id: <20220629153926.137824-1-tgrabiec@scylladb.com>	2022-06-30 08:43:30 +03:00
Pavel Emelyanov	85033ea6ae	Merge 'A bunch of refactors related to Raft group 0' from Kamil Braun The commits here were extracted from PR https://github.com/scylladb/scylla/pull/10835 which implements upgrade procedure for Raft group 0. They are mostly refactors which don't affect the behavior of the system, except one: the commit `4d439a16b3` causes all schema changes to be bounced to shard 0. Previously, they would only be bounced when the local Raft feature was enabled. I do that because: 1. eventually, we want this to be the default behavior 2. in the upgrade PR I remove the `is_raft_enabled()` function - the function was basically created with the mindset "Raft is either enabled or not" - which was right when we didn't support upgrade, but will be incorrect when we introduce intermediate states (when we upgrade from non-raft-based to raft-based operations); the upgrade PR introduces another mechanism to dispatch based on the upgrade state, but for the case of bouncing to shard 0, dispatching is simply not necessary. Closes #10864 * github.com:scylladb/scylla: service/raft: raft_group_registry: add assertions when fetching servers for groups service/raft: raft_group_registry: remove `_raft_support_listener` service/raft: raft_group0: log adding/removing servers to/from group 0 RPC map service/raft: raft_group0: move group 0 RPC handlers from `storage_service` service/raft: messaging: extract raft_addr/inet_addr conversion functions service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` treewide: remove unnecessary `migration_manager::is_raft_enabled()` calls test/boost: memtable_test: perform schema operations on shard 0 test/boost: cdc_test: remove test_cdc_across_shards message: rename `send_message_abortable` to `send_message_cancellable` message: change parameter order in `send_message_oneway_timeout`	2022-06-29 16:51:54 +03:00
Pavel Emelyanov	3a753068be	Merge "Make permissions cache live updateable and add an API for resetting authorization cache" from Igor Ribeiro Barbosa Duarte Currently, for users who have permissions_cache configs set to very high values (and thus can't wait for the configured times to pass) having to restart the service every time they make a change related to permissions or prepared_statements cache (e.g. Adding a user and changing their permissions) can become pretty annoying. This patch series make permissions_validity_in_ms, permissions_update_interval_in_ms and permissions_cache_max_entries live updateable so that restarting the service is not necessary anymore for these cases. It also adds an API for flushing the cache to make it easier for users who don't want to modify their permissions_cache config. branch: https://github.com/igorribeiroduarte/scylla/tree/make_permissions_cache_live_updateable CI: https://jenkins.scylladb.com/job/releng/job/Scylla-CI/1005/ dtests: https://github.com/igorribeiroduarte/scylla-dtest/tree/test_permissions_cache * https://github.com/igorribeiroduarte/scylla/make_permissions_cache_live_updateable: loading_cache_test: Test loading_cache::reset and loading_cache::update_config api: Add API for resetting authorization cache authorization_cache: Make permissions cache and authorized prepared statements cache live updateable auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct utils/loading_cache.hh: Add update_config method utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh utils/loading_cache.hh: Add reset method	2022-06-29 11:14:13 +03:00
Igor Ribeiro Barbosa Duarte	c8c48a98fa	auth_prep_statements_cache: Make aut_prep_statements_cache accept a config struct This patch makes authorized_prepared_statements_cache acccept a config struct, similarly to permissions_cache. This will make it easier to make this cache live updateable on the next patch. Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:57:52 -03:00
Igor Ribeiro Barbosa Duarte	667840a7eb	utils/loading_cache.hh: Rename permissions_cache_config to loading_cache_config and move it to loading_cache.hh This patch renames the permissions_cache_config struct to loading_cache_config and moves it to utils/loading_cache.hh. This will make it easier to handle config updates to the authorization caches on the next patches Signed-off-by: Igor Ribeiro Barbosa Duarte <igor.duarte@scylladb.com>	2022-06-28 19:46:22 -03:00
Botond Dénes	6c818f8625	Merge 'sstables: generation_type tidy-up' from Michael Livshin - Use `sstables::generation_type` in more places - Enforce conceptual separation of `sstables::generation_type` and `int64_t` - Fix `extremum_tracker` so that `sstables::generation_type` can be non-default-constructible Fixes #10796. Closes #10844 * github.com:scylladb/scylla: sstables: make generation_type an actual separate type sstables: use generation_type more soundly extremum_tracker: do not require default-constructible value types	2022-06-28 08:50:12 +03:00
Botond Dénes	1f4f8ba773	Merge 'compaction_manager: track if off-startegy compaction was performed in run_offstrategy_compaction' from Benny Halevy This series moves the logic to not perform off-strategy compaction if the maintenance set is empty from the table layer down to the compaction_manager layer since it is the one that needs to make the decision. With that compaction_manager::perform_offstrategy will return a future<bool> which resolves to true iff off-strategy compaction was required and performed. The sstable_compaction_test was adjusted and a new compaction_manager_for_testing class was added to make sure the compaction manager is enabled when constructed (it wasn't so test_offstrategy_sstable_compaction didn't perform any off-strategy compactions!) and stopped before destroyed. Closes #10848 * github.com:scylladb/scylla: table: perform_offstrategy_compaction: move off-strategy logic to compaction_manager compaction_manager: offstrategy_compaction_task: refactor log printouts test: sstable_compaction: compaction_manager_for_testing	2022-06-24 08:04:02 +03:00
Kamil Braun	bb58ee0b2e	service/raft: raft_group_registry: remove `_raft_support_listener` It did nothing. It will be readded in `raft_group0` and it will do something, stay tuned. With this we can remove the `feature_service` reference from `raft_group_registry`.	2022-06-23 16:14:41 +02:00
Kamil Braun	5da163e0b8	service: storage_service: initialize `raft_group0` in `main` and pass a reference to `join_cluster` `raft_group0` was constructed at the beginning of `join_cluster`, which required passing references to 3 additional services to `join_cluster` used only for that purpose (group 0 client, raft group registry, and query processor). Now we initialize `raft_group0` in main - like all other services - and pass a reference to `join_cluster` so `storage_service` can store a pointer to group 0. We initialize `raft_group0` before we start listening for RPCs in `messaging_service`. In a later commit we'll move the initialization of group 0 related verbs to the constructor of `raft_group0` from `storage_service`, so they will be initialized before we start listening for RPCs.	2022-06-23 16:14:41 +02:00
Benny Halevy	34e9391587	test: sstable_compaction: compaction_manager_for_testing Make the compaction manager for testing using this class. Makes sure to enable the compaction manager and to stop it before it's destroyed. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2022-06-23 08:02:44 +03:00
Piotr Dulikowski	761a037afb	config: add add_per_partition_rate_limit_extension function for testing ...and use it in cql_test_env to enable the per_partition_rate_limit extension for all tests that use it.	2022-06-22 20:16:49 +02:00
Michael Livshin	ab13127761	sstables: use generation_type more soundly `generation_type` is (supposed to be) conceptually different from `int64_t` (even if physically they are the same), but at present Scylla code still largely treats them interchangeably. In addition to using `generation_type` in more places, we provide (no-op) `generation_value()` and `generation_from_value()` operations to make the smoke-and-mirrors more believable. The churn is considerable, but all mechanical. To avoid even more (way, way more) churn, unit test code is left untreated for now, except where it uses the affected core APIs directly. Signed-off-by: Michael Livshin <michael.livshin@scylladb.com>	2022-06-20 19:37:31 +03:00
Botond Dénes	4bd4aa2e88	Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec When memtable receives a tombstone it can happen under some workloads that it covers data which is still in the memtable. Some workloads may insert and delete data within a short time frame. We could reduce the rate of memtable flushes if we eagerly drop tombstoned data. One workload which benefits is the raft log. It stores a row for each uncommitted raft entry. When entries are committed they are deleted. So the live set is expected to be short under normal conditions. Fixes #652. Closes #10807 * github.com:scylladb/scylla: memtable: Add counters for tombstone compaction memtable, cache: Eagerly compact data with tombstones memtable: Subtract from flushed memory when cleaning mvcc: Introduce apply_resume to hold state for partition version merging test: mutation: Compare against compacted mutations compacting_reader: Drop irrelevant tombstones mutation_partition: Extract deletable_row::compact_and_expire() mvcc: Apply mutations in memtable with preemption enabled test: memtable: Make failed_flush_prevents_writes() immune to background merging	2022-06-15 18:12:42 +03:00
Avi Kivity	aa8f135f64	Merge 'Block flush until compaction finishes if sstables accumulate' from Mikołaj Sielużycki If we reach a situation where flush rate exceeds compaction rate, we may end up with arbitrarily large number of sstables on disk. If a read is executed in such case, the amount of memory required is proportional to the number of sstables for the given shard, which in extreme cases can lead to OOM. In the wild, this was observed in 2 scenarios: - A node with >10 shards creates a keyspace with thousands of tables, drops the keyspace and shuts down before compaction finishes. Dropping keyspace drops tables, and each dropped table is smp::count writes to system.local table with flush after write, which creates tens of thousands of sstables. Bootstrap read from system.local will run OOM. - A failure to agree on table schema (due to a code bug) between nodes during repair resulted in excessive flushing of small sstables which compaction couldn't keep up with. In the unit test introduced in this patch series it can be proved that even hard setting maximum shares for compaction and minimum shares for flushing doesn't tilt the balance towards compaction enough to prevent the problem. Since it's a fast producer, slow consumer problem, the remaining solution is to block producer until the consumer catches up. If there are too many table runs originating from memtable, we block the current flush until the number of sstables is reduced (via ongoing compaction or a truncate operation). Fixes https://github.com/scylladb/scylla/issues/4116 Changelog: v5: - added a nicer way of timing the stalls caused by waiting for flush - added predicate on signal when waiting for reduction of the number of sstables to correctly handle spurious wake ups - added comment why we trigger compaction before waiting for sstable count reduction - removed unnecessary cv.signal from table::stop v4: - removed conversion of table::stop to coroutines. It's an orthogonal change and doesn't need to go into this patchset v3: - removed unnecessary change to scheduling groups from v2 - moved sstables_changed signalling to suggested place in table::stop - added log how long the table flush was blocked for - changed the threshold to max(schema()->max_compaction_threshold(), 32) and comparison to <= v2: - Reimplemented waiting algorithm based on reviewers' feedback. It's confined to the table class and it waits in a loop until the number of sstable runs goes below threshold. It uses condition variable which is signaled on sstable set refresh. It handles node shutdown as well. - Converted table::stop to coroutines. - Reordered commits so that test is committed after fix, so it doesn't trip up bisection. Closes #10717 * github.com:scylladb/scylla: table: Add test where compaction doesn't keep up with flush rate. random_mutation_generator: Add option to specify ks_name and cf_name table: Prevent creating unbounded number of sstables	2022-06-15 14:51:08 +03:00
Tomasz Grabiec	02c92d5ea2	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-15 11:30:01 +02:00
Tomasz Grabiec	570b76bc5b	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-15 11:30:01 +02:00
Mikołaj Sielużycki	b5684aa96d	random_mutation_generator: Add option to specify ks_name and cf_name	2022-06-15 10:57:28 +02:00
Pavel Emelyanov	9a88bc260c	Merge 'various group0 start/stop issues' from Gleb The series fixes a couple of crashes that were found during starting and stopping Scylla with raft while doing ddl operations. Most of them related to shutdown order between different components. Also in scylla-dev gleb/group0-fixes-v1 CI https://jenkins.scylladb.com/job/releng/job/Scylla-CI/749/ * origin-dev/gleb/group0-fixes-v1: migration manager: remove unused code db/system_distributed_keyspace: do not announce empty schema main: stop raft before the migration manager storage_service: do not pass the raft group manager to storage_service constructor main: destroy the group0_client after stopping the group0	2022-06-15 11:44:03 +03:00
Avi Kivity	5129280f45	Revert "Merge 'memtable, cache: Eagerly compact data with tombstones' from Tomasz Grabiec" This reverts commit `e0670f0bb5`, reversing changes made to `605ee74c39`. It causes failures in debug mode in database_test.test_database_with_data_in_sstables_is_a_mutation_source_plain, though with low probability. Fixes #10780 Reopens #652.	2022-06-14 18:06:22 +03:00
Gleb Natapov	70b7b2b4d6	storage_service: do not pass the raft group manager to storage_service constructor Reduce the storage_service's dependency on the raft group manager. The group manager is needed only during bootstrap and in an rpc handler, so pass it to those functions directly.	2022-06-09 09:40:55 +03:00
Tomasz Grabiec	374234cf76	test: mutation: Compare against compacted mutations Memtables and cache will compact eagerly, so tests should not expect readers to produce exact mutations written, only those which are equivalant after applying copmaction.	2022-06-06 19:25:40 +02:00
Tomasz Grabiec	604e720706	compacting_reader: Drop irrelevant tombstones The compacting reader created using make_compacting_reader() was not dropping range_tombstone_change fragments which were shadowed by the partition tombstones. As a result the output fragment stream was not minimal. Lack of this change would cause problems in unit tests later in the series after the change which makes memtables lazily compact partition versions. In test_reverse_reader_reads_in_native_reverse_order we compare output of two readers, and assume that compacted streams are the same. If compacting reader doesn't produce minimal output, then the streams could differ if one of them went through the compaction in the memtable (which is minimal).	2022-06-06 19:23:37 +02:00

1 2 3 4 5 ...

620 Commits