scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-20 00:20:47 +00:00

Author	SHA1	Message	Date
Avi Kivity	0ae22a09d4	LICENSE: Update to version 1.1 Updated terms of non-commercial use (must be a never-customer).	2026-04-12 19:46:33 +03:00
Avi Kivity	22949bae52	Merge 'logstor: implement tablet split/merge and migration' from Michael Litvak implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine. * tablet merge simply merges the histograms of segments of one compaction group with another. * for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups. * for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it. * we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range. Refs SCYLLADB-770 no backport - still a new and experimental feature Closes scylladb/scylladb#29207 * github.com:scylladb/scylladb: test: logstor: additional logstor tests docs/dev: add logstor on-disk format section logstor: add version and crc to buffer header test: logstor: tablet split/merge and migration logstor: enable tablet balancing logstor: streaming of logstor segments using stream_blob logstor: add take_logstor_snapshot logstor: segment input/output stream logstor: implement compaction_group::cleanup logstor: tablet split logstor: tablet merge logstor: add compaction reenabler logstor: add segment header logstor: serialize writes to active segment replica: extend compaction_group functions for logstor replica: add compaction_group_for_logstor_segment logstor: code cleanup	2026-04-12 16:11:12 +03:00
Avi Kivity	ca80ee8586	Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup) * maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it * backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there * maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why) * `tablet_allocator::balance_tablets()` * `sstables_manager::components_reclaim_reload_fiber()` * `tablet_storage_group_manager::merge_completion_fiber()` * metrics exporting http server altogether * streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including * hints sender * all view building related components (update generator, builder, workers) * repair * stream_manager * messaging service (except for verb handlers that switch groups) * join_cluster() activity * REST API * ... something else I forgot The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility. All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet). Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group. Fixes SCYLLADB-351 New feature, not backporting Closes scylladb/scylladb#28542 * github.com:scylladb/scylladb: code: Add maintenance/maintenance group backup: Add maintenance/backup group compaction: Add maintenance/maintenance_compaction group main: Introduce maintenance supergroup main: Move all maintenance sched group into streaming one database: Use local variable for current_scheduling_group code: Live-update IO throughputs from main	2026-04-12 00:34:48 +03:00
Piotr Dulikowski	3bd770d4d9	Merge 'counters: reuse counter IDs by rack' from Michael Litvak For counter updates, use a counter ID that is constructed from the node's rack instead of the node's host ID. A rack can have at most two active tablet replicas at a time: a single normal tablet replica, and during tablet migration there are two active replicas, the normal and pending replica. Therefore we can have two unique counter IDs per rack that are reused by all replicas in the rack. We construct the counter ID from the rack UUID, which is constructed from the name "dc:rack". The pending replica uses a deterministic variation of the rack's counter ID by negating it. This improves the performance and size of counter cells by having less unique counter IDs and less counter shards in a counter cell. Previously the number of counter shards was the number of different host_id's that updated the counter, which can be typically the number of nodes in the cluster and continue growing indefinitely when nodes are replaced. with the rack-based counter id the number of counter shards will be at most twice the number of different racks (including removed racks, which should not be significant). Fixes SCYLLADB-356 backport not needed - an enhancement Closes scylladb/scylladb#28901 * github.com:scylladb/scylladb: docs/dev: add counters doc counters: reuse counter IDs by rack	2026-04-10 12:24:18 +02:00
Michael Litvak	b71762d5da	counters: reuse counter IDs by rack For counter updates, use a counter ID that is constructed from the node's rack instead of the node's host ID. A rack can have at most two active tablet replicas at a time: a single normal tablet replica, and during tablet migration there are two active replicas, the normal and pending replica. Therefore we can have two unique counter IDs per rack that are reused by all replicas in the rack. We construct the counter ID from the rack UUID, which is constructed from the name "dc:rack". The pending replica uses a deterministic variation of the rack's counter ID by negating it. This improves the performance and size of counter cells by having less unique counter IDs and less counter shards in a counter cell. Previously the number of counter shards was the number of different host_id's that updated the counter, which can be typically the number of nodes in the cluster and continue growing indefinitely when nodes are replaced. with the rack-based counter id the number of counter shards will be at most twice the number of different racks (including removed racks, which should not be significant). Fixes SCYLLADB-356	2026-04-09 13:08:02 +02:00
Yaniv Michael Kaul	2c0076d3ef	replica: set_skip_when_empty() for rare error-path metrics Add .set_skip_when_empty() to four metrics in replica/database.cc that are only incremented on very rare error paths and are almost always zero: - database::dropped_view_updates: view updates dropped due to overload. NOTE: this metric appears to never be incremented in the current codebase and may be a candidate for removal. - database::multishard_query_failed_reader_stops: documented as a 'hard badness counter' that should always be zero. NOTE: no increment site was found in the current codebase; may be a candidate for removal. - database::multishard_query_failed_reader_saves: documented as a 'hard badness counter' that should always be zero. - database::total_writes_rejected_due_to_out_of_space_prevention: only fires when disk utilization is critical and user table writes are disabled, a very rare operational state. These metrics create unnecessary reporting overhead when they are perpetually zero. set_skip_when_empty() suppresses them from metrics output until they become non-zero. AI-Assisted: yes Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com> Closes scylladb/scylladb#29345	2026-04-09 14:07:28 +03:00
Pavel Emelyanov	78f5bab7cf	table: Add formatter for group_id argument in tablet merge exception message Fixes: SCYLLADB-1432 Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#29143	2026-04-09 11:45:57 +03:00
Raphael S. Carvalho	16e387d5f9	repair/replica: Fix race window where post-repair data is wrongly promoted to repaired During incremental repair, each tablet replica holds three SSTable views: UNREPAIRED, REPAIRING, and REPAIRED. The repair lifecycle is: 1. Replicas snapshot unrepaired SSTables and mark them REPAIRING. 2. Row-level repair streams missing rows between replicas. 3. mark_sstable_as_repaired() runs on all replicas, rewriting the SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1). 4. The coordinator atomically commits sstables_repaired_at=N+1 and the end_repair stage to Raft, then broadcasts repair_update_compaction_ctrl which calls clear_being_repaired(). The bug lives in the window between steps 3 and 4. After step 3, each replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at in Raft is still N. The classifier therefore sees: is_repaired(N, sst{repaired_at=N+1}) == false sst->being_repaired == null (lost on restart, or not yet set) and puts them in the UNREPAIRED view. If a new write arrives and is flushed (repaired_at=0), STCS minor compaction can fire immediately and merge the two SSTables. The output gets repaired_at = max(N+1, 0) = N+1 because compaction preserves the maximum repaired_at of its inputs. Once step 4 commits sstables_repaired_at=N+1, the compacted output is classified REPAIRED on the affected replica even though it contains data that was never part of the repair scan. Other replicas, which did not experience this compaction, classify the same rows as UNREPAIRED. This divergence is never healed by future repairs because the repaired set is considered authoritative. The result is data resurrection: deleted rows can reappear after the next compaction that merges unrepaired data with the wrongly-promoted repaired SSTable. The fix has two layers: Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls mark_as_being_repaired(session) on the new SSTables it writes. This keeps them in the REPAIRING view from the moment they are created until repair_update_compaction_ctrl clears the flag after step 4, covering the race window in the normal (no-restart) case. Layer 2 (durable, restart-safe): a new is_being_repaired() helper on tablet_storage_group_manager detects the race window even after a node restart, when being_repaired has been lost from memory. It checks: sst.repaired_at == sstables_repaired_at + 1 AND tablet transition kind == tablet_transition_kind::repair Both conditions survive restarts: repaired_at is on-disk in SSTable metadata, and the tablet transition is persisted in Raft. Once the coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired() returns true and the SSTable naturally moves to the REPAIRED view. The classifier in make_repair_sstable_classifier_func() is updated to call is_being_repaired(sst, sstables_repaired_at) in place of the previous sst->being_repaired.uuid().is_null() check. A new test, test_incremental_repair_race_window_promotes_unrepaired_data, reproduces the bug by: - Running repair round 1 to establish sstables_repaired_at=1. - Injecting delay_end_repair_update to hold the race window open. - Running repair round 2 so all replicas complete mark_sstable_as_repaired (repaired_at=2) but the coordinator has not yet committed step 4. - Writing post-repair keys to all replicas and flushing servers[1] to create an SSTable with repaired_at=0 on disk. - Restarting servers[1] so being_repaired is lost from memory. - Waiting for autocompaction to merge the two SSTables on servers[1]. - Asserting that the merged SSTable contains post-repair keys (the bug) and that servers[0] and servers[2] do not see those keys as repaired. NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the in-memory being_repaired guard), missing the restart scenario entirely. I pointed out that being_repaired is lost on restart and guided Copilot to add the durable Layer 2 check. I also polished the implementation: moving is_being_repaired into tablet_storage_group_manager so it can reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch), passing sstables_repaired_at in from the classifier to avoid re-reading it, and using compaction_group_for_sstable inside the function rather than threading a tablet_id parameter through the classifier. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Closes scylladb/scylladb#29244	2026-04-09 11:42:28 +03:00
Avi Kivity	00409b61f1	Merge 'Add Vnodes to Tablets Migration Procedure' from Nikos Dragazis This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets. The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following: 1. Create tablet maps for all tables in the keyspace. 2. Sequentially upgrade all nodes from vnodes to tablets: 1. Mark a node for upgrade in the topology state. 2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM. 3. Wait for the node to return online before proceeding to the next node. 4. Finalize the migration: 1. Update the keyspace schema to mark it as tablet-based. 2. Clear the group0 state related to the migration. From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded. During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization. The patch series consists of: * Load balancer adjustments to ignore tablets belonging to a migrating keyspace. * A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder. * A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction. * Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request. * Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor. * Cluster tests for the migration process. Fixes SCYLLADB-722. Fixes SCYLLADB-723. Fixes SCYLLADB-725. Fixes SCYLLADB-779. Fixes SCYLLADB-948. New feature, no backport is needed. Closes scylladb/scylladb#29065 * github.com:scylladb/scylladb: docs: Add ops guide for vnodes-to-tablets migration test: cluster: Add test for migration of multiple keyspaces test: cluster: Add test for error conditions test: cluster: Add vnodes->tablets migration test (rollback) test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes) test: cluster: Add vnodes->tablets migration test (1 table, 1 node) scylla-nodetool: Add migrate-to-tablets subcommand api: Add REST endpoint for vnode-to-tablet migration status api: Add REST endpoint for migration finalization topology_coordinator: Add `finalize_migration` request database: Construct migrating tables with tablet ERMs api: Add REST endpoint for upgrading nodes to tablets api: Add REST endpoint for starting vnodes-to-tablets migration topology_state_machine: Add intended_storage_mode to system.topology distributed_loader: Wire vnode-based resharding into table populator replica: Pick any compaction group for resharding compaction: resharding_compaction: add vnodes_resharding option storage_service: Preserve ERM flavor of migrating tables tablet_allocator: Exclude migrating tables from load balancing feature_service: Add vnodes_to_tablets_migrations feature	2026-04-07 14:32:22 +03:00
Michael Litvak	35547bfb6e	test: logstor: additional logstor tests	2026-03-31 18:45:08 +02:00
Michael Litvak	39baa573d2	logstor: add version and crc to buffer header add basic crc and validation to the buffer header. add also a version field that indicates the version of the on-disk format.	2026-03-31 18:45:08 +02:00
Michael Litvak	78426ae31b	logstor: add take_logstor_snapshot add the function table::take_logstor_snapshot that is similar to take_storage_snapshot for sstables. given a token range, for each storage group in the range, it flushes the separator buffers and then makes a snapshot of all segments in the sg's compaction groups while disabling compaction. the segment snapshot holds a reference to the segment so that it won't be freed by compaction, and it provides an input stream for reading the segment. this will be used for tablet migration to stream the segments.	2026-03-31 18:45:08 +02:00
Michael Litvak	754c1b83bd	logstor: segment input/output stream add functions for creating segment input and output streams, that will be used for segment streaming. the segment input stream creates a file input stream that reads a given segment. the segment output stream allocates a new local segment and creates an output stream that writes to the segment, and when closed it loads the segment and adds it to the compaction group.	2026-03-31 18:45:08 +02:00
Michael Litvak	17cab4181b	logstor: implement compaction_group::cleanup implement compaction group cleanup by clearing the range in the index and discarding the segments of the compaction group. segments are discarded by overwriting the segment header to indicate the segment is empty while preserving the segment generation number in order to not resurrect old data in the segment.	2026-03-31 18:45:08 +02:00
Michael Litvak	9fd6dace72	logstor: tablet split implement tablet split for logstor. flush the separator and then perform split as a new type of compaction: take a batch of segments from the source compaction group, read them and write all live records into left/right write buffers according to the split classifier, flush them to the compaction group, and free the old segments. segments that fit in a single target compaction group are removed from the source and added to the correct target group.	2026-03-31 18:45:08 +02:00
Michael Litvak	5de39afc24	logstor: tablet merge implement tablet merge with logstor. disable compaction for the new compaction group, then merge the merging compaction groups by merging their logstor segments set into the new cg - simply merging the segment histogram.	2026-03-31 18:40:57 +02:00
Michael Litvak	684ce8de71	logstor: add compaction reenabler add a function that stops and disabled compaction for a compaction group and returns a compaction reenabler object, similarly to the normal compaction manager. this will be useful for disabling compaction while doing operations on the compaction group's logstor segment set.	2026-03-31 18:40:56 +02:00
Michael Litvak	1d7c2e4f52	logstor: add segment header we have two types of segments. the active segment is "mixed" because we can write to it multiple write_buffers, each write buffer having records from different tables and tablets. in constrast, the separator and compaction write "full" segments - they write a single write_buffer that has records from a single tablet and storage group. for "full" segments, we add a segment header the contains additional useful metadata such as the table and token range in the segment. the write buffer header contains the type of the buffer, mixed or full. if it's full then it has a segment header placed after the write buffer header.	2026-03-31 18:40:56 +02:00
Michael Litvak	8615f68657	logstor: serialize writes to active segment previously when writing to the active segment, the allocation was serialized but multiple writes could proceed concurrently to different offsets. change it instead to serialize the entire write. we prefer to write larger buffers sequentially instead of multiple buffers concurrently. it is also better that we don't have "holes" in the segment. we also change the buffered_writer to send a single flushing buffer at a time. it has a ring of buffers, new writes are written to the head buffer, and a single consumer flushes the tail buffer.	2026-03-31 18:40:56 +02:00
Michael Litvak	e791823874	replica: extend compaction_group functions for logstor extend compaction_group functions such as disk size calculation and empty() to account also for the logstor segments that the compaction group owns. reuse the sstable_add_gate when there is a write in process to a compaction group, in order for the compaction group to be considered not empty.	2026-03-31 18:40:56 +02:00
Michael Litvak	d3db967802	replica: add compaction_group_for_logstor_segment add the function table::compaction_group_for_logstor_segment that we use when recovering a segment to find the compaction group for a segment based on its token range, similarly to compaction_group_for_sstable for sstables. extract the common logic from compaction_group_for_sstable to a common function compaction_group_for_token_range that finds a compaction group for a token range.	2026-03-31 18:40:56 +02:00
Michael Litvak	bf7bc5b410	logstor: code cleanup misc code cleanup and small changes	2026-03-31 18:40:56 +02:00
Nikos Dragazis	0e1e6ebdc5	database: Construct migrating tables with tablet ERMs Extend `database::add_column_family()` with a `storage_mode` argument. If the table is under vnodes-to-tablets migration and the storage mode is "tablets", create a tablet ERM. Make the distributed loader determine the storage mode from topology (`intended_storage_mode` column in system.topology). Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 13:20:39 +02:00
Nikos Dragazis	bc8109f1a4	distributed_loader: Wire vnode-based resharding into table populator Make the table populator migration-aware. If a table is migrating to tablets, switch from normal resharding to vnode-based resharding. Vnode-based resharding requires passing a vector of "owned ranges" upon which resharding will segregate the SSTables. Compute it from the tablet map. We could also compute them from the vnodes, since tablets are identical to vnodes during the migration, but in the future we may switch to a different model (multiple tablets per vnode). Let the distributed loader decide if a table is migrating or not and communicate that to the table populator. A table is migrating if the keyspace replication strategy uses vnodes but the table replication strategy uses tablets. Currently, tables cannot enter this "migrating" state; support for this will be introduced in the next patches. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Nikos Dragazis	63399951df	replica: Pick any compaction group for resharding In the previous patch, reshard compaction was extended with a special operation mode where SSTables from vnode-based tables are segregated on vnode boundaries and not with the static sharder. This will later be wired into vnodes-to-tablets migration. The problem is that resharding requires a compaction group. With a vnode-based table, there is only one compaction group per shard, and this is what the current code utilizes (`try_get_compaction_group_view_with_static_sharding()`). But the new operation mode will apply to migrating tables, which use a `tablet_storage_group_manager`, which creates one compaction group for each tablet. Some compaction group needs to be selected. Pick any compaction group that is available on the current shard. Reshard compaction is an operation that happens early in the startup process; compaction groups do not own any SSTables yet, so all compaction groups are equivalent. Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>	2026-03-24 11:06:38 +02:00
Benny Halevy	d1c6141407	compaction: resharding_compaction: add vnodes_resharding option In this mode, the output sstables generated by resharding compaction are segregated by token range, based on the keyspace vnode-based owned token ranges vector. A basic unit test was also added to sstable_directory_test. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2026-03-24 11:06:38 +02:00
Botond Dénes	56c375b1f3	Merge 'table: don't close a disengaged querier in query()' from Pavel Emelyanov There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged. The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side. The problem doesn't show itself for real, not worth backporting Closes scylladb/scylladb#29142 * github.com:scylladb/scylladb: table: merge adjacent querier_opt checks in query() table: don't close a disengaged querier in query()	2026-03-24 08:47:35 +02:00
Piotr Dulikowski	60fb5270a9	logstor: fix fmt::format use with std::filesystem::path The version of fmt installed on my machine refuses to work with `std::filesystem::path` directly. Add `.string()` calls in places that attempt to print paths directly in order to make them work. Closes scylladb/scylladb#29148	2026-03-23 15:15:52 +01:00
Pavel Emelyanov	cb329b10bf	code: Add maintenance/maintenance group And move some activities from streaming group into it, namely - tablet_allocator background group - sstables_manager-s components reclaimer - tablet storage group manager merge completion fiber - prometheus All other activity that was in streaming group remains there, but can be moved to this group (or to new maintenance subgroup) later. All but prometheus are patched here, prometheus still uses the maintenance_sched_group variable in main.cc, so it transparently moves into new group Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:03 +03:00
Pavel Emelyanov	de9bfe0f1d	backup: Add maintenance/backup group The snapshot_ctl::backup_task_impl runs in configured scheduling group. Now it's streaming one. This patch introduces the maintenance/backup group and re-configures backup task with it. The group gets its --backup_io_throughput_mb_per_sec option that controls bandwidth limit for this sub-group only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	6f43e8562e	compaction: Add maintenance/maintenance_compaction group Compaction manager tells compaction_sched_group from maintenance_compaction_sched_group. The latter, however, is set to be "streaming" group. This patch adds real maintenance_compaction group under the maintenance supergroup and makes compaction manager use it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	45ecf15fff	database: Use local variable for current_scheduling_group The classify_request() helper captures current scheduling group into local variable and compares it with groups from db_config to decide which "class" it belongs to. One if uses current_scheduling_group(), while it could use the local variable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2026-03-23 16:00:02 +03:00
Pavel Emelyanov	7dce43363e	table: merge adjacent querier_opt checks in query() After the previous fix both guarding if-s start with 'if (querier_opt &&'. Merge them into a single outer 'if (querier_opt)' block to avoid the redundant check and make the structure easier to follow. No functional change. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 14:48:08 +03:00
Pavel Emelyanov	9c1c41df03	table: don't close a disengaged querier in query() The condition guarding querier_opt->close() was: When saved_querier is null the short-circuit makes the whole condition true regardless of whether querier_opt is engaged. If partition_ranges is empty, query_state::done() is true before the while-loop body ever runs, so querier_opt is never created. Calling querier_opt->close() then dereferences a disengaged std::optional — undefined behaviour. Fix by checking querier_opt first: This preserves all existing semantics (close when not saving, or when saving wouldn't be useful) while making the no-querier path safe. Why this doesn't surface today: the sole production call site, database::query(), in practice. The API header documents nullptr as valid ("Pass nullptr when queriers are not saved"), so the bug is real but latent. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-03-20 12:25:13 +03:00
Botond Dénes	5573c3b18e	Merge 'tablets: Fix deadlock in background storage group merge fiber' from Tomasz Grabiec When it deadlocks, groups stop merging and compaction group merge backlog will run-away. Also, graceful shutdown will be blocked on it. Found by flaky unit test test_merge_chooses_best_replica_with_odd_count, which timed-out in 1 in 100 runs. Reason for deadlock: When storage groups are merged, the main compaction group of the new storage group takes a compaction lock, which is appended to _compaction_reenablers_for_merging, and released when the merge completion fiber is done with the whole batch. If we accumulate more than 1 merge cycle for the fiber, deadlock occurs. Lock order will be this Initial state: cg0: main cg1: main cg2: main cg3: main After 1st merge: cg0': main [locked], merging_groups=[cg0.main, cg1.main] cg1': main [locked], merging_groups=[cg2.main, cg3.main] After 2nd merge: cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main] merge completion fiber will try to stop cg0'.main, which will be blocked on compaction lock. which is held by the reenabler in _compaction_reenablers_for_merging, hence deadlock. The fix is to wait for background merge to finish before we start the next merge. It's achieved by holding old erm in the background merge, and doing a topology barrier from the merge finalizing transition. Background merge is supposed to be a relatively quick operation, it's stopping compaction groups. So may wait for active requests. It shouldn't prolong the barrier indefinitely. Tablet tests which trigger merge need to be adjusted to call the barrier, otherwise they will be vulnerable to the deadlock. Fixes SCYLLADB-928 Backport to >= 2025.4 because it's the earliest vulnerable due to `f9021777d8`. Closes scylladb/scylladb#29007 * github.com:scylladb/scylladb: tablets: Fix deadlock in background storage group merge fiber replica: table: Propagate old erm to storage group merge test: boost: tablets_test: Save tablet metadata when ACKing split resize decision storage_service: Extract local_topology_barrier()	2026-03-20 09:05:52 +02:00
Avi Kivity	6b259babeb	Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables. Main flows and components: * The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks. * The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable. * On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO. * On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record. * We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage. * The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments. * Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group. Currently this mode is experimental and requires an experimental flag to be enabled. Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl. to use, add to config: ``` enable_logstor: true experimental_features: - logstor ``` create a table: ``` CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor'; ``` INSERT, SELECT, DELETE work as expected UPDATE not supported yet no backport - new feature Closes scylladb/scylladb#28706 * github.com:scylladb/scylladb: logstor: trigger separator flush for buffers that hold old segments docs/dev: add logstor documentation logstor: recover segments into compaction groups logstor: range read logstor: change index to btree by token per table logstor: move segments to replica::compaction_group db: update dirty mem limits dynamically logstor: track memory usage logstor: logstor stats api logstor: compaction buffer pool logstor: separator: flush buffer when full logstor: hold segment until index updates logstor: truncate table logstor: enable/disable compaction per table logstor: separator buffer pool test: logstor: add separator and compaction tests logstor: segment and separator barrier logstor: separator debt controller logstor: compaction controller logstor: recovery: recover mixed segments using separator logstor: wait for pending reads in compaction logstor: separator logstor: compaction groups logstor: cache files for read logstor: recovery: initial logstor: add segment generation logstor: reserve segments for compaction logstor: index: buckets logstor: add buffer header logstor: add group_id logstor: record generation logstor: generation utility logstor: use RIPEMD-160 for index key test: add test_logstor.py api: add logstor compaction trigger endpoint replica: add logstor to db schema: add logstor cf property logstor: initial commit db: disable tablet balancing with logstor db: add logstor experimental feature flag	2026-03-20 00:18:09 +02:00
Botond Dénes	4981e72607	Merge 'replica: avoid unnecessary computation on token lookup hot path' from Łukasz Paszkowski `storage_group_of()` sits on the replica-side token lookup hot path, yet it called `tablet_map::get_tablet_id_and_range_side()`, which always computes both the tablet id and the post-split range side — even though most callers only need the storage group id. The range-side computation is only relevant when a storage group is in tablet splitting mode, but we were paying for it unconditionally on every lookup. This series fixes that by: 1. Adding `tablet_map::get_tablet_range_side()` so the range side can be computed independently when needed. 2. Adding lazy `select_compaction_group()` overloads that defer the range-side computation until splitting mode is actually active. 3. Switching `storage_group_of()` to use the cheaper `get_tablet_id()` path, only computing the range side on demand. Improvements. No backport is required. Closes scylladb/scylladb#28963 * github.com:scylladb/scylladb: replica/table: avoid computing token range side in storage_group_of() on hot path replica/compaction_group: add lazy select_compaction_group() overloads locator/tablets: add tablet_map::get_tablet_range_side()	2026-03-19 14:27:12 +02:00
Pavel Emelyanov	f27dc12b7c	Merge 'Fix directory lister leak in table::get_snapshot_details: ' from Benny Halevy As reported in SCYLLADB-1013, the directory lister must be closed also when an exception is thrown. For example, see backtrace below: ``` seastar::on_internal_error(seastar::logger&, std::basic_string_view<char, std::char_traits<char>>) at ./build/release/seastar/./seastar/src/core/on_internal_error.cc:57 directory_lister::~directory_lister() at ./utils/lister.cc:77 replica::table::get_snapshot_details(std::filesystem::__cxx11::path, std::filesystem::__cxx11::path) (.resume) at ./replica/table.cc:4081 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/15/../../../../include/c++/15/coroutine:247 (inlined by) seastar::internal::coroutine_traits_base<db::snapshot_ctl::table_snapshot_details>::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:129 seastar::reactor::task_queue::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:2695 (inlined by) seastar::reactor::task_queue_group::run_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3201 seastar::reactor::task_queue_group::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3185 (inlined by) seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3353 seastar::reactor::run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3245 seastar::app_template::run_deprecated(int, char, std::function<void ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:266 seastar::app_template::run(int, char, std::function<seastar::future<int> ()>&&) at ./build/release/seastar/./seastar/src/core/app-template.cc:160 scylla_main(int, char*) at ./main.cc:756 ``` Fixes: [SCYLLADB-1013](https://scylladb.atlassian.net/browse/SCYLLADB-1013) Requires backport to 2026.1 since the leak exists since `004c08f525` [SCYLLADB-1013]: https://scylladb.atlassian.net/browse/SCYLLADB-1013?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ Closes scylladb/scylladb#29084 * github.com:scylladb/scylladb: test/boost/database_test: add test_snapshot_ctl_details_exception_handling table: get_snapshot_details: fix indentation inside try block table: per-snapshot get_snapshot_details: fix typo in comment table: per-snapshot get_snapshot_details: always close lister using try/catch table: get_snapshot_details: always close lister using deferred_close	2026-03-19 12:40:23 +03:00
Michael Litvak	31d339e54a	logstor: trigger separator flush for buffers that hold old segments A compaction group has a separator buffer that holds the mixed segments alive until the separator buffer is flushed. A mixed segment can be freed only after all separator buffers that hold writes from the segment are flushed. Typically a separator buffer is flushed when it becomes full. However it's possible for example that one compaction groups is filled slower than others and holds many segments. To fix this we trigger a separator flush periodically for separator buffers that hold old segments. We track the active segment sequence number and for each separator buffer the oldest sequence number it holds.	2026-03-18 19:24:28 +01:00
Michael Litvak	a0da07e5b7	logstor: recover segments into compaction groups Fix the logstor recovery to work with compaction groups. When recovering a segment find its token range and add it to the appropriate compaction groups. if it doesn't fit in a single compaction group then write each record to its compaction group's separator buffer.	2026-03-18 19:24:28 +01:00
Michael Litvak	24379acc76	logstor: range read extend the logstor mutation reader to support range read	2026-03-18 19:24:28 +01:00
Michael Litvak	a9d0211a64	logstor: change index to btree by token per table Change the primary index to be a btree that is ordered by token, similarly to a memtable, and create a index per-table instead of a single global index.	2026-03-18 19:24:28 +01:00
Michael Litvak	e7c3942d43	logstor: move segments to replica::compaction_group Add a segment_set member to replica::compaction_group that manages the logstor segments that belong to the compaction group, similarly to how it manages sstables. Add also a separator buffer in each compaction group. When writing a mutation to a compaction group, the mutation is written to the active segment and to the separator buffer of the compaction group, and when the separator buffer is flushed the segment is added to the compaction_group's segment set.	2026-03-18 19:24:28 +01:00
Michael Litvak	d69f7eb0ee	db: update dirty mem limits dynamically when logstor is enabled, update the db dirty memory limits dynamically. previously the threshold is set to 0.5 of the available memory, so 0.5 goes to memtables and 0.5 to others (cache). when logstor is enabled, we calculate the available memory excluding logstor, and divide it evenly between memtables and cache.	2026-03-18 19:24:27 +01:00
Michael Litvak	65cd0b5639	logstor: track memory usage add logstor::get_memory_usage() that returns an estimate of the memory usage by logstor. add tracking to how many unique keys are held in the index.	2026-03-18 19:24:27 +01:00
Michael Litvak	b7bdb1010a	logstor: logstor stats api add api to get logstor statistics about segments for a table	2026-03-18 19:24:27 +01:00
Michael Litvak	8bd3bd7e2a	logstor: compaction buffer pool pre-allocate write buffers for compaction	2026-03-18 19:24:27 +01:00
Michael Litvak	caf5aa47c2	logstor: separator: flush buffer when full flush separator buffers when they become full and switched instead of aggregating all the buffers and flushing them when the separator is switched.	2026-03-18 19:24:27 +01:00
Michael Litvak	6ddb7a4d13	logstor: hold segment until index updates add a write gate to write_buffer. when writing a record to the write buffer, the gate is held and passed back to the caller, and the caller holds the gate until the write operation is complete, including follow-up operations such as updating the index after the write. in particular, when writing a mutation in logstor::write, the write buffer is held open until the write is completed and updated in the index. when writing the write buffer to the active segment, we write the buffer and then wait for the write buffer gate to close, i.e. we wait for all index updates to complete before proceeding. the segment is held open until all the write operations and index updates are complete. this property is useful for correctness: when a segment is closed we know that all the writes to it are updated in the index. this is needed in compaction for example, where we take closed segments and check which records in them are alive by looking them up in the index. if the index is not updated yet then it will be wrong.	2026-03-18 19:24:27 +01:00
Michael Litvak	bd66edee5c	logstor: truncate table implement freeing all segments of a table for table truncate. first do barrier to flush all active and mixed segments and put all the table's data in compaction groups, then stop compaction for the table, then free the table's segments and remove the live entries from the index.	2026-03-18 19:24:27 +01:00

1 2 3 4 5 ...

1976 Commits