In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.
The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)
For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.
This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.
Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)
[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQClosesscylladb/scylladb#27517
* github.com:scylladb/scylladb:
test: add tests for CQL forwarding
transport: enable CQL forwarding for strong consistency statements
transport: add remote statement preparation for CQL forwarding
transport: handle redirect responses in CQL forwarding
transport: add exception handling for forwarded CQL requests
transport: add basic CQL request forwarding
idl: add a representation of client_state for forwarding
cql_server: handle query, execute, batch in one case
transport: inline process_on_shard in cql_server::process
transport: extract process() to cql_server
transport: add messaging_service to cql_server
transport: add response reconstruction helpers for forwarding
transport: generalize the bounce result message for bouncing to other nodes
strong consistency: redirect requests to live replicas from the same rack
transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
transport: extract the error handling from process_request_one
transport: move error response helpers from connection to cql_server
This PR adds integrity verification for SSTable component files during loading. When component digests are present in Scylla metadata, the loader now validates each component's CRC32 digest against the stored expected value, catching silent corruption of component files. Index, Rows and Partitions components digests are also validated duriung scrub in validate mode
Added corruption tests that write an SSTable, flip a bit in a specific component file, then verify that reloading the SSTable detects the corruption and throws the expected exception.
Depends on https://github.com/scylladb/scylladb/pull/28338
Backport is not required, this is new feature
Fixes https://github.com/scylladb/scylladb/issues/20103Closesscylladb/scylladb#28761
* github.com:scylladb/scylladb:
test/cqlpy: test --ignore-component-digest-mismatch flag in scylla sstable upgrade
docs: document --ignore-component-digest-mismatch flag for scylla sstable upgrade
sstables: propagate ignore_component_digest_mismatch config to all load sites
sstables: add option to ignore component digest mismatches
sstable_compaction_test: Add scrub validate test for corrupted index
sstables: add tests for component digest validation on corrupted SSTables
sstables: validate index components digests during SSTable scrub in validate mode
sstables: verify component digests on SSTable load
sstables: add digest_file_random_access_reader for CRC32 digest computation
Fixes#25084
Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.
Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.
Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).
Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.
Closesscylladb/scylladb#28727
* github.com:scylladb/scylladb:
gcs_fixture: Change to use docker helper
aws_kms_fixture: Modify to use docker helper
test/lib/proc_util: Add docker helper
pytest: use ephemeral port publish for docker mock servers
dbuild: Use container network in dbuild nested containers
scylla_cluster: Read notify sock in background to prevent deadlock
In the following patches, we'll start allowing forwarding requests to strongly
consistent tables so that they'll get executed on the suitable tablet Raft group
members. For that we'll reuse the approach that we already have for bouncing
requests to other shards - we'll try to execute a request locally, and the
result of that will be a bounce message with another replica as the target.
In this patch we generalize the former bounce_to_shard result message so that
it will be able to specify the target of the bounce as another shard or specific
replica.
We also rename it to result_message::bounce so that it stops implying that only
another shard may be its target.
Aside from the host_id and the shard, the new message also includes the timeout,
because in the service handling the forwarding we won't have the access to it,
and it's needed for specifying how long we should wait for the forwarded
requests. It also includes an information whether this is a write request
to return correct timeout response in case the deadline is exceeded.
We will return other hosts in the new bounce message when executing requests to
strongly consistent tables when we can't handle the request because we aren't
a suitable replica. We can't handle this message yet, so we don't return it
anywhere and we still assume that every bounce message is a bounce to the same
host.
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.
Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.
If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.
Fixes: SCYLLADB-568.
Closesscylladb/scylladb#28739
Remove the host network setting, ensuring we use private networks
(slirp4netns). This will allow nested container port aliasing,
helping CI stability (can use ephemeral ports and container
introspection).
This also makes the nested podman setup non-conditional,
since we only run podman containers inside dbuild, and need
the setup regardless if host container is docker or not.
Add `ignore_component_digest_mismatch` option to `sstable_open_config`
that logs a warning instead of throwing `malformed_sstable_exception`
on component digest mismatch. This is useful for recovering sstables
with corrupted non-vital components or working around bugs in digest
calculation.
Expose the option in scylla-sstable via the
`--ignore-component-digest-mismatch` flag for the upgrade operation.
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
Several test cases where introduced to verify expected behaviour.
Additionally, this PR adds new rewrite component mechanism for safe sstable component rewriting.
Previously, rewriting an sstable component (e.g., via rewrite_statistics) created a temporary file that was renamed to the final name after sealing. This allowed crash recovery by simply removing the temporary file on startup.
However, with component digests stored in scylla_metadata (#20100),
replacing a component like Statistics requires atomically updating both the component
and scylla_metadata with the new digest - impossible with POSIX rename.
The new mechanism creates a clone sstable with a fresh generation:
- Hard-links all components from the source except the component being rewritten and scylla_metadata
- Copies original sstable components pointer and recognized components from the source
- Invokes a modifier callback to adjust the new sstable before rewriting
- Writes the modified component along with updated scylla_metadata containing the new digest
- Seals the new sstable with a temporary TOC
- Replaces the old sstable atomically, the same way as it is done in compaction
This is built on the rewrite_sstables compaction framework to support batch operations (e.g., following incremental repair).
In case of any failure durning the whole process, sstable will be automatically deleted on the node startup due to
temporary toc persistence.
Backport is not required, it is a new feature
Fixes https://github.com/scylladb/scylladb/issues/20100, https://github.com/scylladb/scylladb/issues/27453Closesscylladb/scylladb#28338
* github.com:scylladb/scylladb:
docs: document components_digests subcomponent and trailing digest in Scylla.db
sstable_compaction_test: Add tests for perform_component_rewrite
sstable_test: add verification testcases of SSTable components digests persistance
sstables: store digest of all sstable components in scylla metadata
sstables: replace rewrite_statistics with new rewrite component mechanism
sstables: add new rewrite component mechanism for safe sstable component rewriting
compaction: add compaction_group_view method to specify sstable version
sstables: add null_data_sink and serialized_checksum for checksum-only calculation
sstables: extract default write open flags into a constant
sstables: Add write_simple_with_digest for component checksumming
sstables: Extract file writer closing logic into separate methods
sstables: Implement CRC32 digest-only writer
This change replaces plain file_writer with crc32_digest_file_writer
for all SSTable components that should be checksummed. The resulting component
digests are stored in scylla metadata component.
This also extends new rewrite component mechanism,
to rewrite metadata with updated digest together with the component.
It is ambigous, use the appropriate no-gc or gc-all factories instead,
as appropriate.
A special note for mutation::compacted(): according to the comment above
it, it doesn't drop expired tombstones but as it is currently, it
actually does. Change the tombstone gc param for the underlying call to
compact_for_compaction() to uphold the comment. This is used in tests
mostly, so no fallout expected.
Tests are handled in the next commit, to reduce noise.
Two tests in mutation_test.cc have to be updated:
* test_compactor_range_tombstone_spanning_many_pages
has to be updated in this commit, as it uses
mutation_partition::compact_for_query() as well as
compact_for_query(). The test passes default constructed
tombstone_gc() to the latter while the former now uses no-gc
creating a mismatch in tombstone gc behaviour, resulting in test
failure. Update the test to also pass no-gc to compact_for_query().
* test_query_digest similarly uses mutation_partition::query_mutation()
and another compaction method, having to match the no-gc now used in
query_mutation().
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.
Fixes: SCYLLADB-10
Closesscylladb/scylladb#28741
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.
Backport: Implements missing functionality of a tool, no backport.
Closesscylladb/scylladb#28798
* github.com:scylladb/scylladb:
tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
tools/scylla-sstable: generalize dump_if_user_type
tools/scylla-sstable: move dump_if_user_type() definition
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
Add make_sstable() overload that accepts sstable_version_types parameter
to compaction_group_view interface and all implementations.
This will be useful in rewrite component mechanism, as we
need to preserve sstable version when creating the new one for the replacement.
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.
Add a test which fails before and passes after this patch.
The future toolchain did not build the sanitizers, so debug
executables did not link. Fix by not disabling the sanitizers.
Closesscylladb/scylladb#28733
sccache combines the functions of ccache and distcc, and
promises to support C++20 modules in the future. Switch
to sccache in anticipation of modules support.
The documentation is adjusted since cache will be
persistent for sccache without further work.
Closesscylladb/scylladb#28524
The "--primary-replica-only" ("-pro") flag was previously ignored by
the `restore` operation. This patch ensures the argument is parsed and
applied correctly.
Closesscylladb/scylladb#28490
This reverts commit bcd1758911, reversing
changes made to b2c2a99741.
There is a design decision to not introduce additional test
orchestration tool for scylladb.git (see comments for #27499). One
commit has already been reverted in 55c7bc7. Last CI runs made validator
test flaky, so it is a time to remove all remaining validator tests.
It needs a backport to 2026.1 to remove remaining validator tests from there.
Fixes: VECTOR-497
Closesscylladb/scylladb#28568
Next Fedora will likely not have toxiproxy packaged [1]. Adapt
by installing it directly. To avoid changing the current toolchain,
add a ./install-dependencies --future option. This will allow us
to easily go back to the packages if the Fedora bug is fixed.
[1] https://bugzilla.redhat.com/show_bug.cgi?id=2426954Closesscylladb/scylladb#28444
Filter the content of sstable(s), including or excluding the specified
partitions. Partitions can be provided on the command line via
`--partition`, or in a file via `--partitions-file`.
Produces one output sstable per input sstable -- if the filter selects
at least one partition in the respective input sstable.
Output sstables are placed in the path provided via `--oputput-dir`.
Use `--merge` to filter all input sstables combined, producing one
output sstable.
This flag was added to operations which have an --output-dir
command-line arguments. These operations write sstables and need a
directory where to write them. Back in the numeric-generation world this
posed a problem: if the directory contained any sstable, generation
clash was almost guaranteed, because each scylla-sstable command
invokation would start output generations from 1. To avoid this, empty
output directory was a requirement, with the
--unsafe-accept-nonempty-output-dir allowing for a force-override.
Now in the timeuuid generation days, all this is not necessary anymore:
generations are unique, so it is not a problem if the output directory
already contains sstables: the probability of generation clash is almost
0. Even if it happens, the tool will just simply fail to write the new
sstable with the clashing generation.
Remove this historic relic of a flag and the related logic, it is just a
pointless nuissance nowadays.
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).
This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).
This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.
Fixes#26914.
Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.
Closesscylladb/scylladb#27204
* github.com:scylladb/scylladb:
db/config: Update sstable_compression_user_table_options description
schema: Add initializer for compression defaults
schema: Generalize static configurators into schema initializers
schema: Initialize static properties eagerly
db: config: Add accessor for sstable_compression_user_table_options
test: Check that CQL and Alternator tables respect compression config
The instructions for building optimized clang neglected to mention
that the clang version to be built must be specified. Correct that.
Closesscylladb/scylladb#28135
fmt::localtime() is now deprecated, users should migrate to equivalents
from the standard libraries.
std::localtime is not thread safe, so a local wrapper is introduced,
based on the thread-safe localtime_r() (from libc).
Closesscylladb/scylladb#27821
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).
Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.
Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).
Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.
Fixes#26914.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Unused imports, unused variables and such.
Initially, there were no functional changes, just to get rid of some standard CodeQL warnings.
I've then broken the CI, as apparently there's a install time(!?) Python script creation for the sole purpose of product
naming. I changed it - we have it in etcdir, as SCYLLA-PRODUCT-FILE.
So added (copied from a different script) a get_product() helper function in scylla_util.py and used it instead.
While at it, also fixed the too broad import from scylla_util, which 'forced' me to also fix other specific imports (such as shutil).
Improvement - no need to backport.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#27883
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack
Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.
The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.
Fixesscylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820
backport to relevant versions for materialized views with tablets since it depends on rf-rack validity
Closesscylladb/scylladb#26354
* github.com:scylladb/scylladb:
docs: update RF-rack restrictions
cql3: don't apply RF-rack restrictions on vector indexes
cql3: add warning when creating mv/index with tablets about rf-rack
service/tablet_allocator: always allow tablet merge of tables with views
locator: extend rf-rack validation for rack lists
test: test rf-rack validity when creating keyspace during node ops
locator: fix rf-rack validation during node join/remove
test: test topology restrictions for views with tablets
test: add test_topology_ops_with_rf_rack_valid
topology coordinator: restrict node join/remove to preserve RF-rack validity
topology coordinator: add validation to node remove
locator: extend rf-rack validation functions
view: change validate_view_keyspace to allow MVs if RF=Racks
db: enforce rf-rack-validity for keyspaces with views
replica/db: add enforce_rf_rack_validity_for_keyspace helper
db: remove enforce parameter from check_rf_rack_validity
test: adjust test to not break rf-rack validity
To avoid surprises when libstdc++, clang, or other components
in the toolchain introduce regressions, we introduce a "future
toolchain". This builds on the Fedora version under active
development, and the development branches of gcc and llvm.
The future toolchain is not intended to be frozen. Rather,
periodically we will build the future toolchain, then build
ScyllaDB and run its unit tests under that toolchain, then
discard it. Any problems will then have be be tracked down
by a developer and either reported to the source repository,
or fixed in ScyllaDB.
Closesscylladb/scylladb#27964
* tools/cqlsh scylladb/scylla-cqlsh@9e5a91d7...scylladb/scylla-cqlsh@5a1d7842 (9):
> fix wrong reference in copyutil.py
> Add GitHub Action workflow to create releases on new tags
> test_copyutil.py: introdcue test for ImportTask
> fix(copyutil.py): avoid situatuions file might be move withing multiple processes
> Fix Unix socket port display in show_host() method
> Merge pull request #157 from scylladb/alert-autofix-1
.github/workflows/build-push.yml: Potential fix for code scanning alert no. 1: Workflow does not contain permissions
> .github/workflows/dockerhub-description.yml: Potential fix for code scanning alert no. 9: Workflow does not contain permissions
> test_cqlsh_output: skip some cassandra 5.0 table options
> tests: template compression cql to use `class` insted of `sstable_comprission`
> Pin Cassandra version to 5.0 for reproducible builds
> Remove scylla-enterprise integration test and update Cassandra to latest
Closesscylladb/scylladb#27924
In afb96b6387, we added support for sccache. As a side effect
it changed the method of invoking ccache from transparent via PATH
(if it contains /usr/lib64/ccache) to explicit, by changing the compiler
command line from 'clang++' (which may or may not resolve the the ccache
binary) to 'ccache /usr/local/bin/clang++', which always invokes ccache.
In the default dbuild configuration, PATH does not contain /usr/lib64/ccache,
so ccache isn't invoked by default. Users can change this via the
SCYLLADB_DBUILD environment variable.
As a result of ccache being suddenly enabled for dbuild builds, ccache
will now attempt to create ~/.cache/ccache. Under docker, this does
not work, because we bind-mount ~/.cache/dbuild. Docker will create the
intermediate ~/.cache, but under the root user, not $USER. The intermediate
directory being root-owned prevents ~/.cache/ccache from being created.
Under podman, this does work, because everything runs under the container's
root user.
The fix is to bind-mount the entire ~/.ccache into the container. This
not only lets ccache create the directory, it will also find an existing
~/.cache/ccache directory and use it, enabling reuse across invocations.
Since ccache will now respect configuration changes without access to
its configuration file (notably, the maximum cache size), we also
bind-mount ~/.config.
Since ~/.ccache and ~/.config are not automatically created, we create
them explicitly so the bind mounts can work. This is for new nodes enlisted
from the cloud; developer machines will have those directories preexisting.
Note that the ccache directory used to be ~/.ccache, but was later changed.
Had the author known, we would have bind-mounted ~/.cache much earlier.
Fixes#27919.
Closesscylladb/scylladb#27920
If the table uses UDTs, include the description of these (CREATE TYPE
statement) in the schema dump. Without these the schema is not useful.
Closesscylladb/scylladb#27559