scylladb

Author	SHA1	Message	Date
Avi Kivity	611918056a	Merge 'repair: Add tablet incremental repair support' from Asias He The central idea of incremental repair is to allow repair participants to select and repair only a portion of the dataset to speed up the repair process. All repair participants must utilize an identical selection method to repair and synchronize the same selected dataset. There are two primary selection methods: time-based and file-based. The time-based method selects data within a specified time frame. It is versatile but it is less efficient because it requires reading all of the dataset and omitting data beyond the time frame. The file-based method selects data from unrepaired SSTables and is more efficient because it allows the entire SSTable to be omitted. This document patch implements the file-based selection method. Incremental repair will only be supported for tablet tables; it will not be supported for vnode tables. On one hand, the legacy vnode is less important to support. On the other hand, the incremental repair for vnode is much harder to implement. With vnodes, a SSTalbe could contain data for multiple vnode ranges. When a given vnode range is repaired, only a portion of the SSTable is repaired. This complicates the manipulation of SSTables significantly during both repair and compaction. With tablets, an entire tablet is repaired so that a sstable is either fully repaired or not repaired which is a huge simplification. This patch uses the repaired_at from sstables::statistics component to mark a sstable as repaired. It uses a virtual clock as the repair timestamp, i.e., using a monotonically increasing number for the repaired_at field of a SSTable and sstables_repaired_at column in system.tablets table. Notice that when a sstable is not repaired, the repaired_at field will be set to the default value 0 by default. The being_repaired in memory field of a SSTable is used to explicitly mark that a SSTable is being selected. The following variables are used for incremental repair: The repaired_at on disk field of a SSTable is used. - A 64-bit number increases sequentially The sstables_repaired_at is added to the system.tablets table. - repaired_at <= sstables_repaired_at means the sstable is repaired The being_repaired in memory field of a SSTable is added. - A repair UUID tells which sstable has participated in the repair Initial test results: 1) Medium dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~500GB Cluster pre-populated with ~500GB of data before starting repairs job. Results for Repair Timings: The regular repair run took 210 mins. Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s The speedup is: 183 mins / 48s = 228X 2) Small dataset results Node amount: 3 Instance type: i4i.2xlarge Disk usage per node: ~167GB Cluster pre-populated with ~167GB of data before starting the repairs job. Regular repair 1st run took 110s, 2nd and 3rd runs took 110s. Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds. The speedup is: 110s / 1.5s = 73X 3) Large dataset results Node amount: 6 Instance type: i4i.2xlarge, 3 racks 50% of base load, 50% read/write Dataset == Sum of data on each node Dataset Non-incremental repair (minutes) 1.3 TiB 31:07 3.5 TiB 25:10 5.0 TiB 19:03 6.3 TiB 31:42 Dataset Incremental repair (minutes) 1.3 TiB 24:32 3.0 TiB 13:06 4.0 TiB 5:23 4.8 TiB 7:14 5.6 TiB 3:58 6.3 TiB 7:33 7.0 TiB 6:55 Fixes #22472 Closes scylladb/scylladb#24291 * github.com:scylladb/scylladb: replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair compaction: Move compaction_reenabler to compaction_reenabler.hh topology_coordinator: Make rpc::remote_verb_error to warning level repair: Add metrics for sstable bytes read and skipped from sstables test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair test.py: Add tests for tablet incremental repair repair: Add tablet incremental repair support compaction: Add tablet incremental repair support feature_service: Add TABLET_INCREMENTAL_REPAIR feature tablet_allocator: Add tablet_force_tablet_count_increase and decrease repair: Add incremental helpers sstable: Add being_repaired to sstable sstables: Add set_repaired_at to metadata_collector mutation_compactor: Introduce add operator to compaction_stats tablet: Add sstables_repaired_at to system.tablets table test: Fix drain api in task_manager_client.py	2025-08-19 13:13:22 +03:00
Aleksandra Martyniuk	a10e241228	replica: lower severity of failure log Flush failure with seastar::named_gate_closed_exception is expected if a respective compaction group was already stopped. Lower the severity of a log in dirty_memory_manager::flush_one for this exception. Fixes: https://github.com/scylladb/scylladb/issues/25037. Closes scylladb/scylladb#25355	2025-08-18 13:30:42 +03:00
Asias He	082bc70a0a	replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair It helps to hide the compaction_group_views from repair subsystem.	2025-08-18 11:01:22 +08:00
Asias He	be15972006	compaction: Move compaction_reenabler to compaction_reenabler.hh So it can be used without bringing the whole compaction/compaction_manager.hh.	2025-08-18 11:01:22 +08:00
Asias He	f9021777d8	compaction: Add tablet incremental repair support This patch addes incremental_repair support in compaction. - The sstables are split into repaired and unrepaired set. - Repaired and unrepaired set compact sperately. - The repaired_at from sstable and sstables_repaired_at from system.tablets table are used to decide if a sstable is repaired or not. - Different compactions tasks, e.g., minor, major, scrub, split, are serialized with tablet repair.	2025-08-18 11:01:21 +08:00
Avi Kivity	66173c06a3	Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy Remove support for generating numerical sstable generation for new sstables. Loading such sstables is still supported but new sstables are always created with a uuid generation. This is possible since: * All live versions (since 5.4 / `f014ccf369`) now support uuid sstable generations. * The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / `6da758d74c`) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`. Fixes #24248 * Enhancement, no backport needed Closes scylladb/scylladb#24512 * github.com:scylladb/scylladb: streaming: stream_blob: use the table sstable_generation_generator replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator sstables: sstable_generation_generator: stop tracking highest generation replica: table: get rid of update_sstables_known_generation sstables: sstable_directory: stop tracking highest_generation replica: distributed_loader: stop tracking highest_generation sstables: sstable_generation: get rid of uuid_identifiers bool class sstables_manager: drop uuid_sstable_identifiers feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set test: cql_query_test: add test_sstable_load_mixed_generation_type test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils test: database_test: move table_dir helper to test/lib/test_utils	2025-08-14 11:54:33 +03:00
Botond Dénes	4e15d32151	replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine() To combine the max purgable values, instead of just combining the timestamp values. The former way is still correct, but loses the timestamp explosion optimization, which allows the cache reader to drop timestamps from the overlap checks.	2025-08-11 17:20:12 +03:00
Botond Dénes	bd32d41cad	replica/database: memtable_list::get_max_purgeable(): set expiry-treshold Use the newly introduced expiry_treshold field of max_purgeable, to help exclude memtables from the overlap check if possible.	2025-08-11 17:20:12 +03:00
Botond Dénes	3b1f414fcf	replica/table: propagate gc_state to memtable_list	2025-08-11 07:09:19 +03:00
Botond Dénes	9d00d7e08d	replica/memtable_list: add tombstone_gc_state* member To be passed down to the memtable.	2025-08-11 07:09:19 +03:00
Botond Dénes	ef8a21b4cf	replica/memtable: add tombstone_gc_state_snapshot To be used for possibly excluding the memtable from overlap checks with the cache/sstables, in memtable_list::get_max_purgeable().	2025-08-11 07:09:19 +03:00
Botond Dénes	614d17347a	tombstone_gc: extract shared state into shared_tombstone_gc_state Instead of storing it partially in tombstone_gc and partially in an external map. Move all external parts into the new shared_tombstone_gc_state. This new class is responsible for keeping and updating the repair history. tombstone_gc_state just keeps const pointers to the shared state as before and is only responsible for querying the tombstone gc before times. This separation makes the code easier to follow and also enables further patching of tombstone_gc_state.	2025-08-11 07:09:14 +03:00
Botond Dénes	1d3a3163a3	replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/ Also change to the return type to max_purgeable, instead of raw timestamp. Prepares for further patching of this code.	2025-08-11 07:09:13 +03:00
Botond Dénes	ef7d49cd21	compaction/compaction_garbage_collector: refactor max_purgeable into a class Make members private, add getters and constructors. This struct will get more functionality soon, so class is a better fit.	2025-08-11 07:09:13 +03:00
Asias He	5377f87e5a	tablet: Add sstables_repaired_at to system.tablets table It is used to store the repaired_at for each tablet.	2025-08-11 10:10:07 +08:00
Benny Halevy	de8a199f79	replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator No need to start a local sharded generator. Can just use the table's sstable generation generator to make new sstables now that it's stateless and doesn't depend on the highest generation found (including the uploaded sstables). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	0a20834d2a	replica: table: get rid of update_sstables_known_generation It is not needed anymore. With that database::_sstable_generation_generator can be a regular member rather than optional and initialized later. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	b01524c5a3	replica: distributed_loader: stop tracking highest_generation It is not needed anymore as we always generate uuid generations. Move highest_generation_seen(sharded<sstables::sstable_directory>& directory) to sstables/sstable_directory module. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	6cc964ef16	sstables: sstable_generation: get rid of uuid_identifiers bool class Now that all call sites enable uuid_identifiers. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Benny Halevy	43ee9c0593	sstables_manager: drop uuid_sstable_identifiers It is returning constant sstables::uuid_identifiers::yes now, so let the callers just use the constant (to be dropped in a following patch). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-08 11:46:21 +03:00
Raphael S. Carvalho	beaaf00fac	test: Add test that compaction doesn't cross logical group boundary Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:01 +03:00
Raphael S. Carvalho	d351b0726b	replica: Introduce views in compaction_group for incremental repair Wired the unrepaired, repairing and repaired views into compaction_group. Also the repaired filter was wired, so tablet_storage_group_manager can implement the procedure to classify the sstable. Based on this classifier, we can decide which view a sstable belongs to, at any given point in time. Additionally, we made changes changes to compaction_group_view to return only sstables that belong to the underlying view. From this point on, repaired, repairing and unrepaired sets are connected to compaction manager through their views. And that guarantees sstables on different groups cannot be compacted together. Repairing view specifically has compaction disabled on it altogether, we can revert this later if we want, to allow repairing sstables to be compacted with one another. The benefit of this logical approach is having the classifier as the single source of truth. Otherwise, we'd need to keep the sstable location consistest with global metadata, creating complexity Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	9d3755f276	replica: Futurize retrieval of sstable sets in compaction_group_view This will allow upcoming work to gently produce a sstable set for each compaction group view. Example: repaired and unrepaired. Locking strategy for compaction's sstable selection: Since sstable retrieval path became futurized, tasks in compaction manager will now hold the write lock (compaction_state::lock) when retrieving the sstable list, feeding them into compaction strategy, and finally registering selected sstables as compacting. The last step prevents another concurrent task from picking the same sstable. Previously, all those steps were atomic, but we have seen stall in that area in large installations, so futurization of that area would come sooner or later. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:58:00 +03:00
Raphael S. Carvalho	20c3301a1a	treewide: Futurize estimation of pending compaction tasks This is to allow futurization of compaction_group_view method that retrieves sstable set. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	af3592c658	replica: Allow compaction_group to have more than one view In order to support incremental repair, we'll allow each replica::compaction_group to have two logical compaction groups (or logical sstable sets), one for repaired, another for unrepaired. That means we have to adapt a few places to work with compaction_group_view instead, such that no logical compaction group is missed when doing table or tablet wide operations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	e78295bff1	Move backlog tracker to replica::compaction_group Since there will be only one physical sstable set, it makes sense to move backlog tracker to replica::compaction_group. With incremental repair, it still makes sense to compute backlog accounting both logical sets, since the compound backlog influences the overall read amplification, and the total backlog across repaired and unrepaired sets can help driving decisions like giving up on incremental repair when unrepaired set is almost as large as the repaired set, causing an amplification of 2. Also it's needed for correctness because a sstable can move quickly across the logical sets, and having one tracker for each logical set could cause the sstable to not be erased in the old set it belonged to; Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:29 +03:00
Raphael S. Carvalho	2c4a9ba70c	treewide: Rename table_state to compaction_group_view Since table_state is a view to a compaction group, it makes sense to rename it as so. With upcoming incremental repair, each replica::compaction_group will be actually two compaction groups, so there will be two views for each replica::compaction_group. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-08-08 06:51:28 +03:00
Avi Kivity	8164f72f6e	Merge 'Separate local_effective_replication_map from vnode_effective_replication_map' from Benny Halevy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Refs #22733 * No backport required Closes scylladb/scylladb#25222 * github.com:scylladb/scylladb: locator: abstract_replication_strategy: implement local_replication_strategy locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently locator: abstract_replication_map: rename make_effective_replication_map locator: abstract_replication_map: rename calculate_effective_replication_map replica: database: keyspace: rename {create,update}_effective_replication_map locator: effective_replication_map_factory: rename create_effective_replication_map locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al locator: abstract_replication_strategy: rename global_vnode_effective_replication_map keyspace: rename get_vnode_effective_replication_map dht: range_streamer: use naked e_r_m pointers storage_service: use naked e_r_m pointers alternator: ttl: use naked e_r_m pointers locator: abstract_replication_strategy: define is_local	2025-08-07 12:51:43 +03:00
Benny Halevy	6dbbb80aae	locator: abstract_replication_strategy: implement local_replication_strategy Derive both vnode_effective_replication_map and local_effective_replication_map from static_effective_replication_map as both are static and per-keyspace. However, local_effective_replication_map does not need vnodes for the mapping of all tokens to the local node. Note that everywhere_replication_strategy is not abstracted in a similar way, although it could, since the plan is to get rid of it once all system keyspaces areconverted to local or tablets replication (and propagated everywhere if needed using raft group0) Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:05:11 +03:00
Benny Halevy	34b223f6f9	replica: database: keyspace: rename {create,update}_effective_replication_map to *_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	688bd4fd43	locator: effective_replication_map_factory: rename create_effective_replication_map to create_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	cbad497859	locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al to static_effective_replication_map_ptr, in preparation for separating local_effective_replication_map from vnode_effective_replication_map. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 16:03:53 +03:00
Benny Halevy	bd62421c05	keyspace: rename get_vnode_effective_replication_map to get_static_effective_replication_map, in preparation for separating local_effective_replication_map from vnode_effective_replication_map (both are per-keyspace). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:40:43 +03:00
Benny Halevy	ec85678de1	locator: abstract_replication_strategy: define is_local Prefer for specializing the local replication strategy, local effective replication map, et. al byt defining an is_local() predicate, similar to uses_tablets(). Note that is_vnode_based() still applies to local replication strategy. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-08-06 13:34:23 +03:00
Pavel Emelyanov	0616407be5	Merge 'rest_api: add endpoint which drops all quarantined sstables' from Taras Veretilnyk Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061 Backport is not required, it is new functionality Closes scylladb/scylladb#25063 * github.com:scylladb/scylladb: docs: Add documentation for the nodetool dropquarantinedsstables command nodetool: add command for dropping quarantine sstables rest_api: add endpoint which drops all quarantined sstables	2025-08-06 11:55:15 +03:00
Ferenc Szili	268ec72dc9	truncate: change check for write during truncate into a log warning TRUNCATE TABLE performs a memtable flush and then discards the sstables of the table being truncated. It collects the highest replay position for both of these. When the highest replay position of the discarded sstables is higher than the highest replay position of the flushed memtable, that means that we have had writes during truncate which have been flushed to disk independently of the truncate process. We check for this and trigger an on_internal_error() which throws an exception, informing the user that writing data concurrently with TRUNCATE TABLE is not advised. The problem with this is that truncate is also called from DROP KEYSPACE and DROP TABLE. These are raft operations and exceptions thrown by them are caught by the (...) exception handler in the raft applier fiber, which then exits leaving the node without the ability to execute subsequent raft commands. This commit changes the on_internal_error() into a warning log entry. It also outputs to keyspace/table names, the truncated_at timepoint, the offending replay positions which caused the check to fail. Fixes: #25173 Fixes: #25013	2025-08-04 12:24:50 +02:00
Botond Dénes	7e27157664	replica/table: add_sstables_and_update_cache(): remove error log The plural overload of this method logs an error when the sstable add fails. This is unnecessary, the caller is expected to catch and handle exceptions. Furthermore, this unconditional error log results in sporadic test failures, due to the unexpected error in the logs on shutdown. Fixes: #24850 Closes scylladb/scylladb#25235	2025-07-31 12:34:40 +03:00
Patryk Jędrzejczak	8e43856ca7	Merge 'Pass more elaborated "reasons" to stop_ongoing_compactions()' from Pavel Emelyanov When running compactions are aborted by the aforementioned helper, in logs there appear a line like "Compaction for ks/cf was stopped due to: user-triggered operation". This message could've been better, since it may indicate several distinct reasons described with the same "user-triggered operation". With this PR the message will help telling "truncate", "cleanup", "rewrite" and "split" from each other. Closes scylladb/scylladb#25136 * https://github.com/scylladb/scylladb: compaction: Pass "reason" to perform_task_on_all_files() compaction: Pass "reason" to run_with_compaction_disabled() compaction: Pass "reason" to stop_and_disable_compaction()	2025-07-29 16:06:17 +02:00
Taras Veretilnyk	fa98239ed8	rest_api: add endpoint which drops all quarantined sstables Added a new POST endpoint `/storage_service/drop_quarantined_sstables` to the REST API. This endpoint allows dropping all quarantined SSTables either globally or for a specific keyspace and tables. Optional query parameters `keyspace` and `tables` (comma-separated table names) can be provided to limit the scope of the operation. Fixes scylladb/scylladb#19061	2025-07-28 16:55:17 +02:00
Nadav Har'El	b4fc3578fc	Merge 'LWT: enable for tablet-based tables' from Petr Gusev This PR enables LWT (Lightweight Transactions) support for tablet-based tables by leveraging colocated tables. Currently, storing Paxos state in system tables causes two major issues: * Loss of Paxos state during tablet migration or base table rebuilds * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state. * This breaks LWT correctness in certain scenarios. * Failing test cases demonstrating this: * test_lwt_state_is_preserved_on_tablet_migration * test_lwt_state_is_preserved_on_rebuild * Shard misalignment and performance overhead * Tablets may be placed on arbitrary shards by the tablet balancer. * Accessing Paxos state in system tables could require a shard jump, degrading performance. We move Paxos state into a dedicated Paxos table, colocated with the base table: * Each base table gets its own Paxos state table. * This table is lazily created on the first LWT operation. * Its tablets are colocated with those of the base table, ensuring: * Co-migration during tablet movement * Co-rebuilding with the base table * Shard alignment for local access to Paxos state Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2]. This PR addresses two issues from the "Why doesn't it work for tablets" section in [1]: * Tablet migration vs LWT correctness * Paxos table sharding Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs. References [1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu) [2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0) [3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886) [4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251) Backport: not needed because this is a new feature. Closes scylladb/scylladb#24819 * github.com:scylladb/scylladb: create_keyspace: fix warning for tablets docs: fix lwt.rst docs: fix tablets.rst alternator: enable LWT random_failures: enable execute_lwt_transaction test_tablets_lwt: add test_paxos_state_table_permissions test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft test_tablets_lwt: test timeout creating paxos state table test_tablets_lwt: add test_lwt_concurrent_base_table_recreation test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild test_tablets_lwt: migrate test_lwt_support_with_tablets test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration test_tablets_lwt: add simple test for LWT check_internal_table_permissions: handle Paxos state tables client_state: extract check_internal_table_permissions paxos_store: handle base table removal database: get_base_table_for_tablet_colocation: handle paxos state table paxos_state: use node_local_only mode to access paxos state query_options: add node_local_only mode storage_proxy: handle node_local_only in query storage_proxy: handle node_local_only in mutate storage_proxy: introduce node_local_only flag abstract_replication_strategy: remove unused using storage_proxy: add coordinator_mutate_options storage_proxy: rename create_write_response_handler -> make_write_response_handler storage_proxy: simplify mutate_prepare paxos_state: lazily create paxos state table migration_manager: add timeout to start_group0_operation and announce paxos_store: use non-internal queries qp: make make_internal_options public paxos_store: conditional cf_id filter paxos_store: coroutinize feature_service: add LWT_WITH_TABLETS feature paxos_state: inline system_keyspace functions into paxos_store paxos_state: extract state access functions into paxos_store	2025-07-28 13:19:23 +03:00
Petr Gusev	1b70623908	database: get_base_table_for_tablet_colocation: handle paxos state table We need to mark paxos state table as colocated with the user table, so that the corresponding tablets are migrated/repaired together.	2025-07-24 19:48:08 +02:00
Pavel Emelyanov	db46da45d2	compaction: Pass "reason" to stop_and_disable_compaction() This tells "truncate" operation from other reasons Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-07-22 18:51:16 +03:00
Benny Halevy	fce6c4b41d	tablets: prevent accidental copy of tablets_map As they are wasteful in many cases, it is better to move the tablet_map if possible, or clone it gently in an async fiber. Add clone() and clone_gently() methods to allow explicit copies. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-07-22 15:07:26 +03:00
Avi Kivity	6fce817aa8	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling database::apply() out of schema merging code will allow to batch changes to subsystems. Future generic code will first call prepare() on all implementations, then single database::apply() and then update() on all implementations, then on each shard it will call commit() for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then post_commit(). Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Fixes https://github.com/scylladb/scylladb/issues/24531 Closes scylladb/scylladb#24886 [avi: adjust for std::vector<mutations> -> utils::chunked_vector<mutations>] * github.com:scylladb/scylladb: test: add type creation to test_snapshot storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-07-13 20:47:55 +03:00
Benny Halevy	3feb759943	everywhere: use utils::chunked_vector for list of mutations Currently, we use std::vector<*mutation> to keep a list of mutations for processing. This can lead to large allocation, e.g. when the vector size is a function of the number of tables. Use a chunked vector instead to prevent oversized allocations. `perf-simple-query --smp 1` results obtained for fixed 400MHz frequency and PGO disabled: Before (read path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 89055.97 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 18003 cycles/op, 0 errors) 103372.72 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39380 insns/op, 17300 cycles/op, 0 errors) 98942.27 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39413 insns/op, 17336 cycles/op, 0 errors) 103752.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39407 insns/op, 17252 cycles/op, 0 errors) 102516.77 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39403 insns/op, 17288 cycles/op, 0 errors) throughput: mean= 99528.13 standard-deviation=6155.71 median= 102516.77 median-absolute-deviation=3844.59 maximum=103752.93 minimum=89055.97 instructions_per_op: mean= 39403.99 standard-deviation=14.25 median= 39406.75 median-absolute-deviation=9.30 maximum=39416.63 minimum=39380.39 cpu_cycles_per_op: mean= 17435.81 standard-deviation=318.24 median= 17300.40 median-absolute-deviation=147.59 maximum=18002.53 minimum=17251.75 ``` After (read path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no} Disabling auto compaction Creating 10000 partitions... 59755.04 tps ( 66.2 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39466 insns/op, 22834 cycles/op, 0 errors) 71854.16 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39417 insns/op, 17883 cycles/op, 0 errors) 82149.45 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 39411 insns/op, 17409 cycles/op, 0 errors) 49640.04 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 19975 cycles/op, 0 errors) 54963.22 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.3 tasks/op, 39474 insns/op, 18235 cycles/op, 0 errors) throughput: mean= 63672.38 standard-deviation=13195.12 median= 59755.04 median-absolute-deviation=8709.16 maximum=82149.45 minimum=49640.04 instructions_per_op: mean= 39448.38 standard-deviation=31.60 median= 39466.17 median-absolute-deviation=25.75 maximum=39474.12 minimum=39411.42 cpu_cycles_per_op: mean= 19267.01 standard-deviation=2217.03 median= 18234.80 median-absolute-deviation=1384.25 maximum=22834.26 minimum=17408.67 ``` `perf-simple-query --smp 1 --write` results obtained for fixed 400MHz frequency and PGO disabled: Before (write path): ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 63736.96 tps ( 59.4 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 49667 insns/op, 19924 cycles/op, 0 errors) 64109.41 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 49992 insns/op, 20084 cycles/op, 0 errors) 56950.47 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50005 insns/op, 20501 cycles/op, 0 errors) 44858.42 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50014 insns/op, 21947 cycles/op, 0 errors) 28592.87 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50027 insns/op, 27659 cycles/op, 0 errors) throughput: mean= 51649.63 standard-deviation=15059.74 median= 56950.47 median-absolute-deviation=12087.33 maximum=64109.41 minimum=28592.87 instructions_per_op: mean= 49941.18 standard-deviation=153.76 median= 50005.24 median-absolute-deviation=73.01 maximum=50027.07 minimum=49667.05 cpu_cycles_per_op: mean= 22023.01 standard-deviation=3249.92 median= 20500.74 median-absolute-deviation=1938.76 maximum=27658.75 minimum=19924.32 ``` After (write path) ``` enable-cache=1 Running test with config: {partitions=10000, concurrency=100, mode=write, query_single_key=no, counters=no} Disabling auto compaction 53395.93 tps ( 59.4 allocs/op, 16.5 logallocs/op, 14.3 tasks/op, 50326 insns/op, 21252 cycles/op, 0 errors) 46527.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50704 insns/op, 21555 cycles/op, 0 errors) 55846.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50731 insns/op, 21060 cycles/op, 0 errors) 55669.30 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50735 insns/op, 21521 cycles/op, 0 errors) 52130.17 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 50757 insns/op, 21334 cycles/op, 0 errors) throughput: mean= 52713.91 standard-deviation=3795.38 median= 53395.93 median-absolute-deviation=2955.40 maximum=55846.30 minimum=46527.83 instructions_per_op: mean= 50650.57 standard-deviation=182.46 median= 50731.38 median-absolute-deviation=84.09 maximum=50756.62 minimum=50325.87 cpu_cycles_per_op: mean= 21344.42 standard-deviation=202.86 median= 21334.00 median-absolute-deviation=176.37 maximum=21554.61 minimum=21060.24 ``` Fixes #24815 Improvement for rare corner cases. No backport required Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#24919	2025-07-13 19:13:11 +03:00
Aleksandra Martyniuk	2ec54d4f1a	replica: hold compaction group gate during flush Destructor of database_sstable_write_monitor, which is created in table::try_flush_memtable_to_sstable, tries to get the compaction state of the processed compaction group. If at this point the compaction group is already stopped (and the compaction state is removed), e.g. due to concurrent tablet merge, an exception is thrown and a node coredumps. Add flush gate to compaction group to wait for flushes in compaction_group::stop. Hold the gate in seal function in table::make_memtable_list. seal function is turned into a coroutine to ensure it won't throw. Wait until async_gate is closed before flushing, to ensure that all data is written into sstables. Stop ongoing compactions beforehand. Remove unnecessary flush in tablet_storage_group_manager::merge_completion_fiber. Stop method already flushes the compaction group. Fixes: #23911. Closes scylladb/scylladb#24582	2025-07-13 12:35:19 +03:00
Marcin Maliszkiewicz	b103fee5b6	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	44490ceb77	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	e3f92328d3	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-07-10 10:46:55 +02:00
Marcin Maliszkiewicz	b18cc8145f	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-07-10 10:46:55 +02:00

1 2 3 4 5 ...

1624 Commits