The PR serves two purposes.
First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.
Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.
Code cleanup, not backporting
Closesscylladb/scylladb#29513
* github.com:scylladb/scylladb:
sstables: Remove ignore_component_digest_mismatch from sstable_open_config
sstables: Move ignore_component_digest_mismatch initialization to constructor
sstables: Add ignore_component_digest_mismatch to sstables_manager config
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.
This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.
When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.
1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276
* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport
Closesscylladb/scylladb#29257
* github.com:scylladb/scylladb:
db: implement large_data virtual tables with feature flag gating
db: call initialize_virtual_tables from shard 0 only
test: add LargeDataRecords round-trip unit tests
sstables: populate LargeDataRecords from writer
large_data_handler: return above_threshold_result from maybe_record_large_cells
large_data_handler: rename partition_above_threshold to above_threshold_result
sstables: add LargeDataRecords metadata type (tag 13)
sstables: add fmt::formatter for large_data_type
keys: move key_to_str() to keys/keys.hh
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:
struct log_record {
log_record_header header;
canonical_mutation mut;
};
Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.
This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
headers.
* on compaction, we read the record header first to determine if the
record is alive, if yes then we read the data.
Closesscylladb/scylladb#29457
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records. On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.
Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count
A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Three bugs fixed in segment_manager.cc:
1. write_to_separator(): captured [&index] where index was a local
coroutine-frame reference. The future is stored in
buf.pending_updates and resolved later in flush_separator_buffer(),
by which time the enclosing coroutine frame is destroyed, making
&index a dangling pointer. This is a use-after-free that manifests
as a segfault. Fix: capture index_ptr (raw pointer by value) instead.
2. add_segment_to_compaction_group(): same dangling [&index] pattern
inside the for_each_live_record lambda during recovery. Same fix
applied.
3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
'log_location loc', causing the function to always return a
zero-initialized log_location{}. Fix: remove 'auto' so the
assignment targets the outer variable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29451
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29427
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
Tablet split can call set_split_mode() between the point where
truncate_table_on_all_shards() disables compaction on all existing
compaction groups and the point where discard_sstables() checks that
compaction is disabled. The new split-ready compaction groups created
by set_split_mode() won't have compaction disabled, causing
discard_sstables() to fire on_internal_error.
Fix by preventing set_split_mode() from creating new compaction groups
when compaction is disabled on the main group. If truncation has
already disabled compaction, split will simply report not-ready rather
than creating groups which have compaction enabled.
This is safe because split will be retried once truncation completes
and re-enables compaction.
implement tablet split, tablet merge and tablet migration for tables that use the experimental logstor storage engine.
* tablet merge simply merges the histograms of segments of one compaction group with another.
* for tablet split we take the segments from the source compaction group, read them and write all live records to separate segments according to the split classifier, and move separated segments to the target compaction groups.
* for tablet migration we use stream_blob, similarly to file streaming of sstables. we add a new op type for streaming a logstor segment. on the source we take a snapshot of the segments with an input stream that reads the segment, and on the target we create a sink that allocates a new segment on the target shard and writes to it.
* we also do some improvements for recovery and loading of segments. we add a segment header that contains useful information for non-mixed segments, such as the table and token range.
Refs SCYLLADB-770
no backport - still a new and experimental feature
Closesscylladb/scylladb#29207
* github.com:scylladb/scylladb:
test: logstor: additional logstor tests
docs/dev: add logstor on-disk format section
logstor: add version and crc to buffer header
test: logstor: tablet split/merge and migration
logstor: enable tablet balancing
logstor: streaming of logstor segments using stream_blob
logstor: add take_logstor_snapshot
logstor: segment input/output stream
logstor: implement compaction_group::cleanup
logstor: tablet split
logstor: tablet merge
logstor: add compaction reenabler
logstor: add segment header
logstor: serialize writes to active segment
replica: extend compaction_group functions for logstor
replica: add compaction_group_for_logstor_segment
logstor: code cleanup
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)
* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
* `tablet_allocator::balance_tablets()`
* `sstables_manager::components_reclaim_reload_fiber()`
* `tablet_storage_group_manager::merge_completion_fiber()`
* metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
* hints sender
* all view building related components (update generator, builder, workers)
* repair
* stream_manager
* messaging service (except for verb handlers that switch groups)
* join_cluster() activity
* REST API
* ... something else I forgot
The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.
All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).
Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.
Fixes SCYLLADB-351
New feature, not backporting
Closesscylladb/scylladb#28542
* github.com:scylladb/scylladb:
code: Add maintenance/maintenance group
backup: Add maintenance/backup group
compaction: Add maintenance/maintenance_compaction group
main: Introduce maintenance supergroup
main: Move all maintenance sched group into streaming one
database: Use local variable for current_scheduling_group
code: Live-update IO throughputs from main
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
backport not needed - an enhancement
Closesscylladb/scylladb#28901
* github.com:scylladb/scylladb:
docs/dev: add counters doc
counters: reuse counter IDs by rack
For counter updates, use a counter ID that is constructed from the
node's rack instead of the node's host ID.
A rack can have at most two active tablet replicas at a time: a single
normal tablet replica, and during tablet migration there are two active
replicas, the normal and pending replica. Therefore we can have two
unique counter IDs per rack that are reused by all replicas in the rack.
We construct the counter ID from the rack UUID, which is constructed
from the name "dc:rack". The pending replica uses a deterministic
variation of the rack's counter ID by negating it.
This improves the performance and size of counter cells by having less
unique counter IDs and less counter shards in a counter cell.
Previously the number of counter shards was the number of different
host_id's that updated the counter, which can be typically the number of
nodes in the cluster and continue growing indefinitely when nodes are
replaced. with the rack-based counter id the number of counter shards
will be at most twice the number of different racks (including removed
racks, which should not be significant).
Fixes SCYLLADB-356
Add .set_skip_when_empty() to four metrics in replica/database.cc that
are only incremented on very rare error paths and are almost always zero:
- database::dropped_view_updates: view updates dropped due to overload.
NOTE: this metric appears to never be incremented in the current
codebase and may be a candidate for removal.
- database::multishard_query_failed_reader_stops: documented as a 'hard
badness counter' that should always be zero. NOTE: no increment site
was found in the current codebase; may be a candidate for removal.
- database::multishard_query_failed_reader_saves: documented as a 'hard
badness counter' that should always be zero.
- database::total_writes_rejected_due_to_out_of_space_prevention: only
fires when disk utilization is critical and user table writes are
disabled, a very rare operational state.
These metrics create unnecessary reporting overhead when they are
perpetually zero. set_skip_when_empty() suppresses them from metrics
output until they become non-zero.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29345
During incremental repair, each tablet replica holds three SSTable views:
UNREPAIRED, REPAIRING, and REPAIRED. The repair lifecycle is:
1. Replicas snapshot unrepaired SSTables and mark them REPAIRING.
2. Row-level repair streams missing rows between replicas.
3. mark_sstable_as_repaired() runs on all replicas, rewriting the
SSTables with repaired_at = sstables_repaired_at + 1 (e.g. N+1).
4. The coordinator atomically commits sstables_repaired_at=N+1 and
the end_repair stage to Raft, then broadcasts
repair_update_compaction_ctrl which calls clear_being_repaired().
The bug lives in the window between steps 3 and 4. After step 3, each
replica has on-disk SSTables with repaired_at=N+1, but sstables_repaired_at
in Raft is still N. The classifier therefore sees:
is_repaired(N, sst{repaired_at=N+1}) == false
sst->being_repaired == null (lost on restart, or not yet set)
and puts them in the UNREPAIRED view. If a new write arrives and is
flushed (repaired_at=0), STCS minor compaction can fire immediately and
merge the two SSTables. The output gets repaired_at = max(N+1, 0) = N+1
because compaction preserves the maximum repaired_at of its inputs.
Once step 4 commits sstables_repaired_at=N+1, the compacted output is
classified REPAIRED on the affected replica even though it contains data
that was never part of the repair scan. Other replicas, which did not
experience this compaction, classify the same rows as UNREPAIRED. This
divergence is never healed by future repairs because the repaired set is
considered authoritative. The result is data resurrection: deleted rows
can reappear after the next compaction that merges unrepaired data with the
wrongly-promoted repaired SSTable.
The fix has two layers:
Layer 1 (in-memory, fast path): mark_sstable_as_repaired() now also calls
mark_as_being_repaired(session) on the new SSTables it writes. This keeps
them in the REPAIRING view from the moment they are created until
repair_update_compaction_ctrl clears the flag after step 4, covering the
race window in the normal (no-restart) case.
Layer 2 (durable, restart-safe): a new is_being_repaired() helper on
tablet_storage_group_manager detects the race window even after a node
restart, when being_repaired has been lost from memory. It checks:
sst.repaired_at == sstables_repaired_at + 1
AND tablet transition kind == tablet_transition_kind::repair
Both conditions survive restarts: repaired_at is on-disk in SSTable
metadata, and the tablet transition is persisted in Raft. Once the
coordinator commits sstables_repaired_at=N+1 (step 4), is_repaired()
returns true and the SSTable naturally moves to the REPAIRED view.
The classifier in make_repair_sstable_classifier_func() is updated to call
is_being_repaired(sst, sstables_repaired_at) in place of the previous
sst->being_repaired.uuid().is_null() check.
A new test, test_incremental_repair_race_window_promotes_unrepaired_data,
reproduces the bug by:
- Running repair round 1 to establish sstables_repaired_at=1.
- Injecting delay_end_repair_update to hold the race window open.
- Running repair round 2 so all replicas complete mark_sstable_as_repaired
(repaired_at=2) but the coordinator has not yet committed step 4.
- Writing post-repair keys to all replicas and flushing servers[1] to
create an SSTable with repaired_at=0 on disk.
- Restarting servers[1] so being_repaired is lost from memory.
- Waiting for autocompaction to merge the two SSTables on servers[1].
- Asserting that the merged SSTable contains post-repair keys (the bug)
and that servers[0] and servers[2] do not see those keys as repaired.
NOTE FOR MAINTAINER: Copilot initially only implemented Layer 1 (the
in-memory being_repaired guard), missing the restart scenario entirely.
I pointed out that being_repaired is lost on restart and guided Copilot
to add the durable Layer 2 check. I also polished the implementation:
moving is_being_repaired into tablet_storage_group_manager so it can
reuse the already-held _tablet_map (avoiding an ERM lookup and try/catch),
passing sstables_repaired_at in from the classifier to avoid re-reading it,
and using compaction_group_for_sstable inside the function rather than
threading a tablet_id parameter through the classifier.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1239.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29244
This PR introduces the vnodes-to-tablets migration procedure, which enables converting an existing vnode-based keyspace to tablets.
The migration is implemented as a manual, operator-driven process executed in several stages. The core idea is to first create tablet maps with the same token boundaries and replica hosts as the vnodes, and then incrementally convert the storage of each node to the tablets layout. At a high level, the procedure is the following:
1. Create tablet maps for all tables in the keyspace.
2. Sequentially upgrade all nodes from vnodes to tablets:
1. Mark a node for upgrade in the topology state.
2. Restart the node. During startup, while the node is offline, it reshards the SSTables on vnode boundaries and switches to a tablet ERM.
3. Wait for the node to return online before proceeding to the next node.
4. Finalize the migration:
1. Update the keyspace schema to mark it as tablet-based.
2. Clear the group0 state related to the migration.
From the client's perspective, the migration is online; the cluster can still serve requests on that keyspace, although performance may be temporarily degraded.
During the migration, some nodes use vnode ERMs while others use tablet ERMs. Cluster-level algorithms such as load balancing will treat the keyspace's tables as vnode-based. Once migration is finalized, the keyspace is permanently switched to tablets and cannot be reverted back to vnodes. However, a rollback procedure is available before finalization.
The patch series consists of:
* Load balancer adjustments to ignore tablets belonging to a migrating keyspace.
* A new vnode-based resharding mode, where SSTables are segregated on vnode boundaries rather than with the static sharder.
* A new per-node `intended_storage_mode` column in `system.topology`. Represents migration intent (whether migration should occur on restart) and direction.
* Four new REST endpoints for driving the migration (start, node upgrade/downgrade, finalize, status), along with `nodetool` wrappers. The finalization is implemented as a global topology request.
* Wiring of the migration process into the startup logic: the `distributed_loader` determines a migrating table's ERM flavor from the `intended_storage_mode` and the ERM flavor determines the `table_populator`'s resharding mode. Token metadata changes have been adjusted to preserve the ERM flavor.
* Cluster tests for the migration process.
Fixes SCYLLADB-722.
Fixes SCYLLADB-723.
Fixes SCYLLADB-725.
Fixes SCYLLADB-779.
Fixes SCYLLADB-948.
New feature, no backport is needed.
Closesscylladb/scylladb#29065
* github.com:scylladb/scylladb:
docs: Add ops guide for vnodes-to-tablets migration
test: cluster: Add test for migration of multiple keyspaces
test: cluster: Add test for error conditions
test: cluster: Add vnodes->tablets migration test (rollback)
test: cluster: Add vnodes->tablets migration test (1 table, 3 nodes)
test: cluster: Add vnodes->tablets migration test (1 table, 1 node)
scylla-nodetool: Add migrate-to-tablets subcommand
api: Add REST endpoint for vnode-to-tablet migration status
api: Add REST endpoint for migration finalization
topology_coordinator: Add `finalize_migration` request
database: Construct migrating tables with tablet ERMs
api: Add REST endpoint for upgrading nodes to tablets
api: Add REST endpoint for starting vnodes-to-tablets migration
topology_state_machine: Add intended_storage_mode to system.topology
distributed_loader: Wire vnode-based resharding into table populator
replica: Pick any compaction group for resharding
compaction: resharding_compaction: add vnodes_resharding option
storage_service: Preserve ERM flavor of migrating tables
tablet_allocator: Exclude migrating tables from load balancing
feature_service: Add vnodes_to_tablets_migrations feature
add the function table::take_logstor_snapshot that is similar to
take_storage_snapshot for sstables.
given a token range, for each storage group in the range, it flushes the
separator buffers and then makes a snapshot of all segments in the sg's
compaction groups while disabling compaction.
the segment snapshot holds a reference to the segment so that it won't
be freed by compaction, and it provides an input stream for reading the
segment.
this will be used for tablet migration to stream the segments.
add functions for creating segment input and output streams, that will
be used for segment streaming.
the segment input stream creates a file input stream that reads a given
segment.
the segment output stream allocates a new local segment and creates an
output stream that writes to the segment, and when closed it loads the
segment and adds it to the compaction group.
implement compaction group cleanup by clearing the range in the index
and discarding the segments of the compaction group.
segments are discarded by overwriting the segment header to indicate the
segment is empty while preserving the segment generation number in order
to not resurrect old data in the segment.
implement tablet split for logstor.
flush the separator and then perform split as a new type of compaction:
take a batch of segments from the source compaction group, read them and
write all live records into left/right write buffers according to the
split classifier, flush them to the compaction group, and free the old
segments. segments that fit in a single target compaction group are
removed from the source and added to the correct target group.
implement tablet merge with logstor.
disable compaction for the new compaction group, then merge the merging
compaction groups by merging their logstor segments set into the new cg
- simply merging the segment histogram.
add a function that stops and disabled compaction for a compaction group
and returns a compaction reenabler object, similarly to the normal
compaction manager.
this will be useful for disabling compaction while doing operations on
the compaction group's logstor segment set.
we have two types of segments. the active segment is "mixed" because we
can write to it multiple write_buffers, each write buffer having records
from different tables and tablets. in constrast, the separator and
compaction write "full" segments - they write a single write_buffer that
has records from a single tablet and storage group.
for "full" segments, we add a segment header the contains additional
useful metadata such as the table and token range in the segment.
the write buffer header contains the type of the buffer, mixed or full.
if it's full then it has a segment header placed after the write buffer
header.
previously when writing to the active segment, the allocation was
serialized but multiple writes could proceed concurrently to different
offsets. change it instead to serialize the entire write.
we prefer to write larger buffers sequentially instead of multiple
buffers concurrently. it is also better that we don't have "holes" in
the segment.
we also change the buffered_writer to send a single flushing buffer at a
time. it has a ring of buffers, new writes are written to the head
buffer, and a single consumer flushes the tail buffer.
extend compaction_group functions such as disk size calculation and
empty() to account also for the logstor segments that the compaction
group owns.
reuse the sstable_add_gate when there is a write in process to a
compaction group, in order for the compaction group to be considered not
empty.
add the function table::compaction_group_for_logstor_segment that we use
when recovering a segment to find the compaction group for a segment
based on its token range, similarly to compaction_group_for_sstable for
sstables.
extract the common logic from compaction_group_for_sstable to a common
function compaction_group_for_token_range that finds a compaction group
for a token range.
Extend `database::add_column_family()` with a `storage_mode` argument.
If the table is under vnodes-to-tablets migration and the storage mode
is "tablets", create a tablet ERM.
Make the distributed loader determine the storage mode from topology
(`intended_storage_mode` column in system.topology).
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Make the table populator migration-aware. If a table is migrating to
tablets, switch from normal resharding to vnode-based resharding.
Vnode-based resharding requires passing a vector of "owned ranges" upon
which resharding will segregate the SSTables. Compute it from the tablet
map. We could also compute them from the vnodes, since tablets are
identical to vnodes during the migration, but in the future we may
switch to a different model (multiple tablets per vnode).
Let the distributed loader decide if a table is migrating or not and
communicate that to the table populator. A table is migrating if the
keyspace replication strategy uses vnodes but the table replication
strategy uses tablets.
Currently, tables cannot enter this "migrating" state; support for this
will be introduced in the next patches.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In the previous patch, reshard compaction was extended with a special
operation mode where SSTables from vnode-based tables are segregated on
vnode boundaries and not with the static sharder. This will later be
wired into vnodes-to-tablets migration.
The problem is that resharding requires a compaction group. With a
vnode-based table, there is only one compaction group per shard, and
this is what the current code utilizes
(`try_get_compaction_group_view_with_static_sharding()`). But the new
operation mode will apply to migrating tables, which use a
`tablet_storage_group_manager`, which creates one compaction group for
each tablet. Some compaction group needs to be selected.
Pick any compaction group that is available on the current shard.
Reshard compaction is an operation that happens early in the startup
process; compaction groups do not own any SSTables yet, so all
compaction groups are equivalent.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In this mode, the output sstables generated by resharding
compaction are segregated by token range, based on the keyspace
vnode-based owned token ranges vector.
A basic unit test was also added to sstable_directory_test.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There's a flaw in table::query() -- calling querier_opt->close() can dereferences a disengaged std::optional. The fix pretty simple. Once fixed, there are two if-s checking for querier_opt being engaged or not that are worth being merged.
The problem doesn't really shows itself becase table::query() is not called with null saved_querier, so the de-facto if is always correct. However, better to be on safe-side.
The problem doesn't show itself for real, not worth backporting
Closesscylladb/scylladb#29142
* github.com:scylladb/scylladb:
table: merge adjacent querier_opt checks in query()
table: don't close a disengaged querier in query()
The version of fmt installed on my machine refuses to work with
`std::filesystem::path` directly. Add `.string()` calls in places that
attempt to print paths directly in order to make them work.
Closesscylladb/scylladb#29148
And move some activities from streaming group into it, namely
- tablet_allocator background group
- sstables_manager-s components reclaimer
- tablet storage group manager merge completion fiber
- prometheus
All other activity that was in streaming group remains there, but can be
moved to this group (or to new maintenance subgroup) later.
All but prometheus are patched here, prometheus still uses the
maintenance_sched_group variable in main.cc, so it transparently
moves into new group
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The snapshot_ctl::backup_task_impl runs in configured scheduling group.
Now it's streaming one. This patch introduces the maintenance/backup
group and re-configures backup task with it.
The group gets its --backup_io_throughput_mb_per_sec option that
controls bandwidth limit for this sub-group only.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The classify_request() helper captures current scheduling group into
local variable and compares it with groups from db_config to decide
which "class" it belongs to.
One if uses current_scheduling_group(), while it could use the local
variable.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After the previous fix both guarding if-s start with 'if (querier_opt &&'.
Merge them into a single outer 'if (querier_opt)' block to avoid the
redundant check and make the structure easier to follow.
No functional change.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The condition guarding querier_opt->close() was:
When saved_querier is null the short-circuit makes the whole condition true
regardless of whether querier_opt is engaged. If partition_ranges is empty,
query_state::done() is true before the while-loop body ever runs, so querier_opt
is never created. Calling querier_opt->close() then dereferences a disengaged
std::optional — undefined behaviour.
Fix by checking querier_opt first:
This preserves all existing semantics (close when not saving, or when saving
wouldn't be useful) while making the no-querier path safe.
Why this doesn't surface today: the sole production call site, database::query(),
in practice. The API header documents nullptr as valid ("Pass nullptr when
queriers are not saved"), so the bug is real but latent.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When it deadlocks, groups stop merging and compaction group merge
backlog will run-away.
Also, graceful shutdown will be blocked on it.
Found by flaky unit test
test_merge_chooses_best_replica_with_odd_count, which timed-out in 1
in 100 runs.
Reason for deadlock:
When storage groups are merged, the main compaction group of the new
storage group takes a compaction lock, which is appended to
_compaction_reenablers_for_merging, and released when the merge
completion fiber is done with the whole batch.
If we accumulate more than 1 merge cycle for the fiber, deadlock
occurs. Lock order will be this
Initial state:
cg0: main
cg1: main
cg2: main
cg3: main
After 1st merge:
cg0': main [locked], merging_groups=[cg0.main, cg1.main]
cg1': main [locked], merging_groups=[cg2.main, cg3.main]
After 2nd merge:
cg0'': main [locked], merging_groups=[cg0'.main [locked], cg0.main, cg1.main, cg1'.main [locked], cg2.main, cg3.main]
merge completion fiber will try to stop cg0'.main, which will be
blocked on compaction lock. which is held by the reenabler in
_compaction_reenablers_for_merging, hence deadlock.
The fix is to wait for background merge to finish before we start the
next merge. It's achieved by holding old erm in the background merge,
and doing a topology barrier from the merge finalizing transition.
Background merge is supposed to be a relatively quick operation, it's
stopping compaction groups. So may wait for active requests. It
shouldn't prolong the barrier indefinitely.
Tablet tests which trigger merge need to be adjusted to call the
barrier, otherwise they will be vulnerable to the deadlock.
Fixes SCYLLADB-928
Backport to >= 2025.4 because it's the earliest vulnerable due to f9021777d8.
Closesscylladb/scylladb#29007
* github.com:scylladb/scylladb:
tablets: Fix deadlock in background storage group merge fiber
replica: table: Propagate old erm to storage group merge
test: boost: tablets_test: Save tablet metadata when ACKing split resize decision
storage_service: Extract local_topology_barrier()