Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.
This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.
Fixes#5563Closesscylladb/scylladb#28984
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a CI stability issue and should be backported.
Closesscylladb/scylladb#29504
* github.com:scylladb/scylladb:
test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
test: fix proc_utils.cc formatting from previous commit
test: lib: use unique container name per retry attempt
test: lib: fix broken retry in start_docker_service
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config
Options migrated to cql_config:
1. select_internal_page_size
2. strict_allow_filtering
3. enable_parallelized_aggregation
4. batch_size_warn_threshold_in_kb
5. batch_size_fail_threshold_in_kb
6. 7 keyspace replication restriction options
7. 2 TWCS restriction options
8. restrict_future_timestamp
9. strict_is_not_null_in_views (with view_restrictions struct)
10. enable_create_table_with_compact_storage
Some options need special treatment and are still abused via database, namely:
1. enable_logstor
2. cluster_name
3. partitioner
4. endpoint_snitch
Fixing components inter-dependencies, not backporting
Closesscylladb/scylladb#29424
* github.com:scylladb/scylladb:
cql3: Move enable_create_table_with_compact_storage to cql_config
cql3: Move strict_is_not_null_in_views to cql_config
cql3: Move restrict_future_timestamp to cql_config
cql3: Move TWCS restriction options to cql_config
cql3: Move keyspace restriction options to cql_config
cql3: Move batch_size_fail_threshold_in_kb to cql_config
cql3: Move batch_size_warn_threshold_in_kb to cql_config
cql3: Move enable_parallelized_aggregation to cql_config
cql3: Move strict_allow_filtering to cql_config
cql3: Move select_internal_page_size to cql_config
test: Fix cql_test_env to use updateable cql_config from db::config
cql3: Add cql_config parameter to parsed_statement::prepare()
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).
Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.
Refs SCYLLADB-1542
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".
Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.
Refs SCYLLADB-1542
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).
The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:
- Before the feature is enabled: the old physical tables remain in
all_tables(), CQL writes are active, no virtual tables are registered.
This ensures safe rollback during rolling upgrades.
- After the feature is enabled: old physical tables are dropped from
disk via legacy_drop_table_on_all_shards(), virtual tables are
registered on all shards, and CQL writes are skipped via
skip_cql_writes() in cql_table_large_data_handler.
Key implementation details:
- Three virtual table classes (large_partitions_virtual_table,
large_rows_virtual_table, large_cells_virtual_table) extend
streaming_virtual_table with cross-shard record collection.
- generate_legacy_id() gains a version parameter; virtual tables
use version 1 to get different UUIDs than the old physical tables.
- compaction_time is derived from SSTable generation UUID at display
time via UUID_gen::unix_timestamp().
- Legacy SSTables without LargeDataRecords emit synthetic summary
rows based on above_threshold > 0 in LargeDataStats.
- The activation logic uses two paths: when the feature is already
enabled (test env, restart), it runs as a coroutine; when not yet
enabled, it registers a when_enabled callback that runs inside
seastar::async from feature_service::enable().
- sstable_3_x_test updated to use a simplified large_data_test_handler
and validate LargeDataRecords in SSTable metadata directly.
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.
This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():
- test_large_data_records_round_trip: verifies partition_size, row_size,
and cell_size records are written with correct field semantics when
thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
are written when data is below all thresholds
Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).
Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.
Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.
Fixes SCYLLADB-1542
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.
v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.
No backport needed since this is code removal.
Closesscylladb/scylladb#29105
* github.com:scylladb/scylladb:
view: drop unused v1 builder code
view: remove upgrade to raft code
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)
* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
* `tablet_allocator::balance_tablets()`
* `sstables_manager::components_reclaim_reload_fiber()`
* `tablet_storage_group_manager::merge_completion_fiber()`
* metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
* hints sender
* all view building related components (update generator, builder, workers)
* repair
* stream_manager
* messaging service (except for verb handlers that switch groups)
* join_cluster() activity
* REST API
* ... something else I forgot
The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.
All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).
Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.
Fixes SCYLLADB-351
New feature, not backporting
Closesscylladb/scylladb#28542
* github.com:scylladb/scylladb:
code: Add maintenance/maintenance group
backup: Add maintenance/backup group
compaction: Add maintenance/maintenance_compaction group
main: Introduce maintenance supergroup
main: Move all maintenance sched group into streaming one
database: Use local variable for current_scheduling_group
code: Live-update IO throughputs from main
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.
The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.
* No functional change and no backport needed
Closesscylladb/scylladb#29209
* github.com:scylladb/scylladb:
test: add test_sstable_clone_preserves_staging_state
test: derive sstable state from directory in test_env::make_sstable
sstables: log debug message in filesystem_storage::clone
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216Closesscylladb/scylladb#29219
Instead of always passing sstable_state::normal, infer the state from
the last component of the directory path by comparing against the known
state subdirectory constants (staging_dir, upload_dir, quarantine_dir).
Any unrecognized path component (the common case for normal-state
sstables) maps to sstable_state::normal.
When a non-normal state is detected, strip the state subdirectory from
dir so that the base table directory is passed to storage.
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.
In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.
Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.
A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128
Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side
Closesscylladb/scylladb#29110
* github.com:scylladb/scylladb:
encryption: fix deadlock in encrypted_data_source::get()
test_lib: mark `limiting_data_source_impl` as not `final`
Fix formatting after previous patch
Fix indentation after previous patch
test_lib: make limiting_data_source_impl available to tests
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.
Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.
Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.
to use, add to config:
```
enable_logstor: true
experimental_features:
- logstor
```
create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```
INSERT, SELECT, DELETE work as expected
UPDATE not supported yet
no backport - new feature
Closesscylladb/scylladb#28706
* github.com:scylladb/scylladb:
logstor: trigger separator flush for buffers that hold old segments
docs/dev: add logstor documentation
logstor: recover segments into compaction groups
logstor: range read
logstor: change index to btree by token per table
logstor: move segments to replica::compaction_group
db: update dirty mem limits dynamically
logstor: track memory usage
logstor: logstor stats api
logstor: compaction buffer pool
logstor: separator: flush buffer when full
logstor: hold segment until index updates
logstor: truncate table
logstor: enable/disable compaction per table
logstor: separator buffer pool
test: logstor: add separator and compaction tests
logstor: segment and separator barrier
logstor: separator debt controller
logstor: compaction controller
logstor: recovery: recover mixed segments using separator
logstor: wait for pending reads in compaction
logstor: separator
logstor: compaction groups
logstor: cache files for read
logstor: recovery: initial
logstor: add segment generation
logstor: reserve segments for compaction
logstor: index: buckets
logstor: add buffer header
logstor: add group_id
logstor: record generation
logstor: generation utility
logstor: use RIPEMD-160 for index key
test: add test_logstor.py
api: add logstor compaction trigger endpoint
replica: add logstor to db
schema: add logstor cf property
logstor: initial commit
db: disable tablet balancing with logstor
db: add logstor experimental feature flag
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
Fixes#25084
Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.
Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.
Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).
Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.
Closesscylladb/scylladb#28727
* github.com:scylladb/scylladb:
gcs_fixture: Change to use docker helper
aws_kms_fixture: Modify to use docker helper
test/lib/proc_util: Add docker helper
pytest: use ephemeral port publish for docker mock servers
dbuild: Use container network in dbuild nested containers
scylla_cluster: Read notify sock in background to prevent deadlock
Forwarding CQL requests is not implemented yet, but we're already
prepared to return the target to forward to when trying to execute
strongly consistent requests. Currently, if we're not a replica
of the affected tablet, we redirect the request to the first replica
in the list.
This is not optimal, because this replica may be down or it may be
in another rack, making us perform cross-rack requests during forwarding.
Instead, we should forward the request to the replica from the same
rack and handle the case where the replica is down.
In this patch we change the replica selection for forwarding strongly
consistent requests, so that when the coordinator isn't a replica, it
redirects the request to the replica from the same rack.
If the replica from the same rack is down, or there is no replica in
our rack, we choose the next closest replica (preferring same-DC replicas
over other DCs). If no replica is alive, the query fails - the driver
should retry when some replica comes back up.
This series adds a global read barrier to raft_group0_client, ensuring that Raft group0 mutations are applied on all live nodes before returning to the caller.
Currently, after a group0_batch::commit, the mutations are only guaranteed to be applied on the leader. Other nodes may still be catching up, leading to stale reads. This patch introduces a broadcast read barrier mechanism. Calling send_group0_read_barrier_to_live_members after committing will cause the coordinator to send a read barrier RPC to all live nodes (discovered via gossiper) and waits for them to complete. This is best effort attempt to get cluster-wide visibility of the committed state before the response is returned to the user.
Auth and service levels write paths are switched to use this new mechanism.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-650
Backport: no, new feature
Closesscylladb/scylladb#28731
* https://github.com/scylladb/scylladb:
test: add tests for global group0_batch barrier feature
qos: switch service levels write paths to use global group0_batch barrier
auth: switch write paths to use global group0_batch barrier
raft: add function to broadcast read barrier request
raft: add gossiper dependency to raft_group0_client
raft: add read barrier RPC
Remove the rest of the code that assumes that either group0 does not exist yet or a cluster is till not upgraded to raft topology. Both of those are not supported any more.
No need to backport since we remove functionality here.
Closesscylladb/scylladb#28841
* github.com:scylladb/scylladb:
service level: remove version 1 service level code
features: move GROUP0_SCHEMA_VERSIONING to deprecated features list
migration_manager: remove unused forward definitions
test: remove unused code
auth: drop auth_migration_listener since it does nothing now
schema: drop schema_registry_entry::maybe_sync() function
schema: drop make_table_deleting_mutations since it should not be needed with raft
schema: remove calculate_schema_digest function
schema: drop recalculate_schema_version function and its uses
migration_manager: drop check for group0_schema_versioning feature
cdc: drop usage of cdc_local table and v1 generation definition
storage_service: no need to add yourself to the topology during reboot since raft state loading already did it
storage_service: remove unused functions
group0: drop with_raft() function from group0_guard since it always returns true now
gossiper: do not gossip TOKENS and CDC_GENERATION_ID any more
gossiper: drop tokens from loaded_endpoint_state
gossiper: remove unused functions
storage_service: do not pass loaded_peer_features to join_topology()
storage_service: remove unused fields from replacement_info
gossiper: drop is_safe_for_restart() function and its use
storage_service: remove unused variables from join_topology
gossiper: remove the code that was only used in gossiper topology
storage_service: drop the check for raft mode from recovery code
cdc: remove legacy code
test: remove unused injection points
auth: remove legacy auth mode and upgrade code
treewide: remove schema pull code since we never pull schema any more
raft topology: drop upgrade_state and its type from the topology state machine since it is not used any longer
group0: hoist the checks for an illegal upgrade into main.cc
api: drop get_topology_upgrade_state and always report upgrade status as done
service_level_controller: drop service level upgrade code
test: drop run_with_raft_recovery parameter to cql_test_env
group0: get rid of group0_upgrade_state
storage_service: drop topology_change_kind as it is no longer needed
storage_service: drop check_ability_to_perform_topology_operation since no upgrades can happen any more
service_storage: remove unused functions
storage_service: remove non raft rebuild code
storage_service: set topology change kind only once
group0: drop in_recovery function and its uses
group0: rename use_raft to maintenance_mode and make it sync
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.
Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.
However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
Backport is not required, it is a new feature
Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453Closesscylladb/scylladb#28338
* github.com:scylladb/scylladb:
docs: document components_digests subcomponent and trailing digest in Scylla.db
sstable_compaction_test: Add tests for perform_component_rewrite
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: replace rewrite_statistics with new rewrite component mechanism
sstables: add new rewrite component mechanism for safe sstable component rewriting
compaction: add compaction_group_view method to specify sstable version
sstables: add null_data_sink and serialized_checksum for checksum-only calculation
sstables: extract default write open flags into a constant
sstables: Add write_simple_with_digest for component checksumming
sstables: Extract file writer closing logic into separate methods
sstables: Implement CRC32 digest-only writer
Simplify code by getting rid of group0_upgrade_state since upgrade is no
longer supported, so no need to track its state. The none upgraded node
will simply not boot and to detect that the patch checks the state
directly from the system table.
This patch fixes 2 issues within strong consistency state machine:
- it might happen that apply is called before the schema is delivered to the node
- on the other hand, the apply may be called after the schema was changed and purged from the schema registry
The first problem is fixed by doing `group0.read_barrier()` before applying the mutations.
The second one is solved by upgrading the mutations using column mappings in case the version of the mutations' schema is older.
Fixes SCYLLADB-428
Strong consistency is in experimental phase, no need to backport.
Closesscylladb/scylladb#28546
* https://github.com/scylladb/scylladb:
test/cluster/test_strong_consistency: add reproducer for old schema during apply
test/cluster/test_strong_consistency: add reproducer for missing schema during apply
test/cluster/test_strong_consistency: extract common function
raft_group_registry: allow to drop append entries requests for specific raft group
strong_consistency/state_machine: find and hold schemas of applying mutations
strong_consistency/state_machine: pull necessary dependencies
db/schema_tables: add `get_column_mapping_if_exists()`
Both migration manager and system keyspace will be used in next commit.
The first one is needed to execute group0 read barrier and we need
system keyspace to get column mappings.