During shutdown, group0 may be torn down while
cache_table_info() has a detached setup_table() future
in flight. This causes raft_group_not_found to propagate
as an abandoned failed future.
Add .handle_exception() to log the failure at debug level
instead of leaving the future unobserved.
Fixes: SCYLLADB-2224
Backport to 2026.2 and 2026.1, because the test failed on 2026.1
Closesscylladb/scylladb#30093
* github.com:scylladb/scylladb:
test: table_helper: verify detached setup failure is consumed
table_helper: observe detached setup_table() future
Adds write-path guardrails that reject or warn on mutations targeting partitions, rows, or collections that already exceed configured size thresholds, based on SSTable `large_data_record` metadata.
ScyllaDB already detects and records large partitions/rows/cells in `system.large_data_records` after compaction, but takes no preventive action on the write path. Once a partition grows past operational limits it causes latency spikes, OOM, and repair failures. These guardrails let operators set hard and soft thresholds so that writes to already-oversized data are rejected (hard) or logged as warnings (soft) before they make the problem worse.
- **Intrusive index over SSTable metadata**: A per-table `large_data_record_index` maintains three `boost::intrusive::multiset`s (partitions, rows, cells) using `auto_unlink` hooks directly on `large_data_record`. SSTable destruction automatically removes records from the index — no explicit deregistration needed.
- **Virtual dispatch for zero-cost disabled path**: `large_data_guardrail_base` → `noop_large_data_guardrail` / `large_data_guardrail`. Tables without guardrails enabled pay only a virtual call to a no-op. No index is built or maintained for disabled tables.
- **Schema storage**: The per-table flag is stored as a scylla_tables column, following the tablets pattern: only write a live cell when enabled, omit entirely when disabled. The CQL feature gate prevents enabling until all nodes are upgraded.
- **Write-path integration**: The guardrail check runs in `do_apply` after the frozen mutation is deserialized but before it is applied to the memtable. Hint replay and Paxos learn skip the check via `skip_large_data_guardrails`.
Uses existing `large_*_warn_threshold` config options as soft limits and new `large_*_fail_threshold` options as hard limits. Checked dimensions:
- Partition size (bytes)
- Partition row count
- Row size (bytes)
- Collection element count
Backport is not required
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-180Closesscylladb/scylladb#29733
* github.com:scylladb/scylladb:
test/cqlpy: add per-table toggle, LWT exemption, and multi-category tests
test/cqlpy: add large collection guardrail tests
test/cqlpy: add large row guardrail tests
test/cqlpy: add large partition guardrail tests
test/boost: add large_data_guardrail unit tests
test/cluster: add large data guardrails rolling upgrade test
replica: wire large_data_guardrail into the write path
schema: add per-table large_data_guardrails_enabled flag
db: implement large_data_guardrail
db: implement large_data_record_index
sstables: add intrusive index hook to large_data_record
db: add large_collection_elements_fail_threshold config option
db: add large_row_fail_threshold_mb config option
db: add rows_count_fail_threshold config option
db: add large_partition_fail_threshold_mb config option
replica: introduce large_data_exception
In 6165124fcc, we changed analysis of expressions in the WHERE clause
to use predicates, an annotated form of an expression that constrains a column
when the expression is set to true.
Here, we exploit this work to simplify the analysis further, reusing already computed
attributes rather than re-analyzing the expression.
Not backporting, this is a refactor with no functional change and no bugs fixed.
Closesscylladb/scylladb#30049
* github.com:scylladb/scylladb:
cql3: statement_restrictions: simplify find_idx to return only the index
cql3: statement_restrictions: replace has_only_eq_binops with tracked booleans
cql3: statement_restrictions: use index-selection predicates for value_for_index_partition_key
cql3: statement_restrictions: replace find_clustering_order with predicate order field
cql3: statement_restrictions: replace has_partition_token with variant check
cql3: statement_restrictions: replace has_slice with predicate is_slice check
cql3: statement_restrictions: replace contains_multi_column_restriction filter with _has_multi_column
cql3: statement_restrictions: remove unused find_needs_filtering and has_slice_or_needs_filtering
cql3: statement_restrictions: replace has_slice_or_needs_filtering with tracked bool
cql3: statement_restrictions: replace contains_multi_column_restriction with _has_multi_column
cql3: statement_restrictions: replace find_needs_filtering with predicate op check
cql3: statement_restrictions: replace find_binop is_on_collection with tracked bool
cql3: statement_restrictions: replace find_binop column extraction with predicate on field
cql3: statement_restrictions: set op on all binary-operator-derived predicates
The expression returned as the second element of find_idx()'s pair was
stored in view_indexed_table_select_statement::_used_index_restrictions
but never read — dead code. Simplify find_idx() to return just the
optional<index>, and remove the dead member and constructor parameter
from view_indexed_table_select_statement.
The now unused _idx_restrictions is also removed.
8 tests covering the record_compare template comparator,
intrusive multiset equal_range grouping with heterogeneous
lookup_key, and auto_unlink on record destruction.
The schema_registry_grace_period field on schema_ctxt was only used by
schema_registry itself for eviction timing. Move it to be a direct member
of schema_registry, passed at init() time. This removes one db::config
dependency from schema_ctxt.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#30038
Add a per-table large_data_guardrails_enabled flag controlled via the CQL
table property WITH large_data_guardrails_enabled = true|false.
Store the flag as a boolean column in system_schema_ext.scylla_tables.
Only write a live cell when enabled; when disabled (the default), omit
the cell entirely so that old nodes that don't know this column can
still read the SSTable during rolling upgrade or rollback. When the
property transitions from true to false via ALTER TABLE, a tombstone is
written in make_update_table_mutations to override the previous live
cell — this is safe because the CQL feature gate ensures all nodes are
upgraded before the property can be set to true.
Gate the CQL property behind the LARGE_DATA_GUARDRAILS cluster feature:
attempting to set large_data_guardrails_enabled = true before all nodes
advertise the feature raises a ConfigurationException.
When two partition keys share the same token, their relative order is
determined by their raw serialized bytes (legacy_tri_compare), which
matches the physical on-disk order in SSTables. The validator was
using partition_key::tri_compare instead — a type-aware comparator
that can disagree with byte order for types like timeuuid.
The result was a false-positive "out-of-order partition key" error
for any two same-token partitions whose timeuuid (or other type-aware)
order is the reverse of their byte order. In scrub mode this caused
the second partition to be silently dropped.
Fixes: SCYLLADB-2304
Closesscylladb/scylladb#30120
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone
and/or static row while subsequent mutations contain only clustered
rows. The hint logic would correctly clear tokens (marking a full
partition read) upon seeing the tombstone in the first mutation, but
then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.
Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens. Additionally, fall back to a full partition read
when the tablet map is missing locally, which happens when the joining
node receives tablet metadata for a table it has never seen before.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-2303.
Needs backports to 2026.1+. 2026.1 introduces the regression with b17a36c071Closesscylladb/scylladb#30115
* github.com:scylladb/scylladb:
tablets: fall back to full partition read when tablet map is missing
tablets: fix hint re-adding tokens after full partition read decision
When partition_split_builder splits a tablet metadata partition into
multiple mutations, the first mutation gets the partition tombstone and/or
static row while subsequent mutations contain only clustered rows.
The tablet metadata change hint logic would correctly clear tokens (marking
a full partition read) upon seeing the tombstone in the first mutation,
but then re-add tokens when processing the subsequent row-only mutations.
This caused update_tablet_metadata to attempt a point update via
mutate_tablet_map_async on a tablet map that doesn't exist yet during
bootstrap, throwing no_such_tablet_map and failing the snapshot transfer.
Fix by adding a full_read flag to table_hint. Once a full partition read
is decided (due to partition tombstone, range tombstone, static row, or
row deletion), the flag prevents subsequent mutations for the same table
from re-adding tokens.
Replace all uses of the deprecated seastar::smp::count with this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards() across the ScyllaDB codebase (seastar submodule untouched).
Both replacement functions require a reactor thread context. All call sites were verified to run on reactor threads.
Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default parameter value. This is safe since all callers are on reactor threads, but the expression is now evaluated at each call site rather than being a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh, ent/encryption/encryption.cc: used in default member initializers and constructor member-init-lists. Objects are always constructed on reactor threads.
- schema_builder: sometimes called from BOOST_AUTO_TEST_CASE without a reactor. Added pre-patch that makes the implicit shard count parameter implicit and pass 1 in those cases.
Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
No backport: the Seastar commit that deprecated these function hasn't (and won't) make its way into any release branches (and the warnings are cosmetic anyway)
Closesscylladb/scylladb#29990
* github.com:scylladb/scylladb:
treewide: replace deprecated smp::count and smp::all_cpus() with new APIs
scylla-gdb: read shard count from smp::_this_smp instead of smp::count
schema_builder: make shard_count an explicit constructor parameter
Replace all uses of the deprecated seastar::smp::count with
this_smp_shard_count() and smp::all_cpus() with this_smp_all_shards()
across the ScyllaDB codebase (seastar submodule untouched).
Both replacement functions require a reactor thread context. All call
sites were verified to run on reactor threads.
Notable cases:
- dht/token-sharding.hh: this_smp_shard_count() is used as a default
parameter value. This is safe since all callers are on reactor threads,
but the expression is now evaluated at each call site rather than being
a reference to a global variable.
- service/storage_service.hh, locator/abstract_replication_strategy.hh,
ent/encryption/encryption.cc: used in default member initializers and
constructor member-init-lists. Objects are always constructed on reactor
threads.
Not changed:
- scylla-gdb.py: reads smp::count as a GDB symbol (no reactor context).
- Python test files: only reference smp::count in comments/strings.
Queries are stored and passed around as sstring/std::string_view. While normally they are small enough to not cause problems, as the `test_cdc_large_values.TestLargeColumnsWithCDC.test_single_column_blob_max_size_with_cdc_preimage_full_postimage[unprepared_statements]` demonstrates, queries can be arbitrarily large, putting heavy strain on Scylla internals via large allocations, in the extreme case causing denial of service.
This PR attempts to alleviate this by using fragmented storage for queries: read query as fragmented string from the input stream in `transport/server.cc`, propagate it as such to `query_processor::prepare()` and also store it as such in `cql3::cql_statement::raw_cql_statement`. Also avoid linearizing raw values during in the CQL expression tree: switch `cql3::expr::untyped_constant::raw_text` to fragmented storage.
For this to be possible, some infrastructure code had to be made fragmented storage friendly: ascii/utf8 validation, hashers, from_hex and importantly: `abstract_type::from_string()`.
Unfortunately, the query still has to be linearized for parsing itself, as ANTLR -- although allows for custom InputStream implementation -- plays pointer arithmetics games with the pointers obtained from them, so fragmented input cannot be used.
Still, this PR limits the places where the query is linearized to the
following:
* Parsing
* Audit
* Logs and error messages
So the normal query paths for queries that actually can get arbitrarily large (UPDATE and INSERT) should only linearize the query temporarily for parsing.
Fixes#10779
Improvement, no backport
Closesscylladb/scylladb#28619
* github.com:scylladb/scylladb:
tracing: add_query(): change query param to utils::chunked_string
cql3: store raw query string in utils::chunked_string
serializer: add serializer<utils::chunked_string>
utils/reusable_buffer: add get_linearized_view(managed_bytes_view)
cql3/expr: use utils::chunked_string for untyped_constant::raw_text
types: abstract_type::from_string() switch to fragmented buffers (implementation)
types: abstract_type::from_string() switch to fragmented buffers (interface)
types: use write_fragmented from utils/fragment_range.hh
types: timestamp_from_string(): don't assume std::string_view is null-terminated
types/duration: don't assume std::string_view is null-terminated
utils/hashers: add calculate(managed_bytes_view) overload
utils/ascii: add validate(managed_bytes_view) overload
utils: add managed_bytes_fwd.hh
utils: add chunked_string
utils: add managed_bytes_basic_view::byte_iterator
Add test_best_effort_setup_table_failure_is_consumed which
triggers a setup_table() failure via a missing keyspace and
asserts no abandoned future escapes. This guards against
regressions where the detached future loses its exception
handler.
Remove the test_skipped_no_error_injection placeholder since
the new test runs unconditionally keeping the suite non-empty
in all build modes.
A recent Seastar update deprecated smp::count and introduced
this_smp_shard_count() as a replacement. One difference is that
this_smp_shard_count() wants to run on a reactor thread.
This poses a problem for non-reactor tests (BOOST_AUTO_TEST_CASE)
that nevertheless use a schema, as the schema_builder constructor
references smp::count. If we replace it with this_smp_shard_count()
then it will crash when running without a reactor.
To fix, remove the implicit this_smp_shard_count() call from raw_schema's
constructor and require callers to pass shard_count explicitly to
schema_builder. This allows tests that don't run on a reactor thread
to construct schemas without crashing.
Production code and reactor-based tests pass this_smp_shard_count().
Non-reactor test files (expr_test, keys_test, nonwrapping_interval_test,
wrapping_interval_test, bti_key_translation_test, range_tombstone_list_test)
pass a fixed shard count of 1.
Note: sstable_test.cc is a Seastar test file (SEASTAR_THREAD_TEST_CASE)
but also contains one plain BOOST_AUTO_TEST_CASE
(test_empty_key_view_comparison) that constructs a schema_builder without
a reactor context. This test also receives a fixed shard count of 1.
Several error injection sites use the low-level get_injection_parameters() API to fetch the entire parameters map and then manually look up a single key. The inject_parameter() API is better suited for these cases — it combines the enabled check and typed single-parameter extraction in one call, returning std::optional.
Cleaning error injection usage, not backporting
Closesscylladb/scylladb#29970
* github.com:scylladb/scylladb:
test: Use inject_parameter() in row_cache_test
sstables: Use inject_parameter() for mx reader fill buffer timeout
streaming: Use inject_parameter() for order_sstables_for_streaming
The previous patch changed the interface and callers, this one updates
the implementation to actually work with fragmented buffers. Most types
just use with_linearized() to linearize the fragmented input buffer for
parsing. This is fine, as most types have a fixed or bounded-size string
representation that is small.
Importantly, the input is not linearized for the 3 types which have
unbounded values: ascii, bytes and text. The tuple type can contain any
of these types itself, so it is also converted to avoid linearization.
Change input: str::string_view -> utils::chunked_string_view.
Change return value: bytes -> managed_bytes.
This patch only changes the interface, with some to_bytes() sprinkled in
the internals to deal with recursive calls.
Internals will be updated in the next patch, to keep the churn of
updating callers separate from the actually important changes.
Uses update() for each fragment, then finalize. Yields identical hash to
calling calculate(std::string_view) with linearized buffer. This is
checked by new tests.
A thin facade over managed_bytes[_view], offering some extra
convenience for working with strings, as well as a strong type
communicating the purpose (storing text instead of a blob).
Also introduces utils::from_hex(chunked_string_view), a fragmented
hex-decode that operates directly on a chunked_string_view without
requiring linearization. Hex pairs straddling fragment boundaries
are handled via a carry-over nibble.
bytes-wise iterator which works both as bidirectional-iterator and as
output-iterator (for mutable views). Allows using managed_bytes_view in
algorithms which are iterator based.
Added unit tests for covering the iterator functionality.
This patch adds a new expression type, unary_operator, analogous to
the existing binary_operator but takes just one operand instead of
two.
This patch also implements the first and only unary operator type,
unary_oper_t::NEG, implementing negation (unary minus) for all numeric
types.
For fixed-width integer types overflow or underflow results in an error.
If the operand is NULL, the result is a NULL as well.
The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We also do not plan to use it in the
following patch which uses the separate SUB (subtraction) operation,
not the new NEG. But since I already implemented a unary minus operator,
and we'll surely need it in the future for general arithmentic operations,
I thought I might as well include this patch as well.
Refs #22918 ("Support arithmetic operators")
In this patch we add to our expressions oper_t::SUB, for subtraction,
analogous to the ADD from the previous patch.
The only reason why we need a separate SUB operation and can't just
combine ADD with a unary minus (NEG) operator is the minimum integer
in fixed-sized integer. For example, 8-bit integers have the range
-128...127. A subtraction like -1 - (-128) is valid (its value is 127)
but the negation of (-128) would be invalid (128). One of the tests
we add in this patch validates this fact.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Extend oper_t with a new ADD operator, to represent addition between two
numeric expressions. Supports all numeric types - tinyint, smallint,
int, bigint, float, double, varint, and decimal.
For fixed-width integer type overflow or underflow results in an error.
If one of the operand is NULL, the result is also a NULL.
The new operator is not yet used by the CQL syntax - our parser doesn't
parse arithmetic expressions yet. We plan to start using this new operator
in a following patch which implements counter syntax ("SET r = r + 1" )
for LWT, but in the future we can use it for more general cases.
At the moment, ADD requires that both operands have the same type.
This is all we need for the first use case, and this limitation can
be relaxed later.
Interestingly, ADD is our first binary operator implementation that
does not return a boolean. Until now all our binary operators have been
comparison operators, and all returned boolean. In contrast, ADD's
return type is the type of its operands.
This implementation is susceptible to the pre-existing bug SCYLLADB-1576,
where adding 1e1000000 and 1 in "decimal" or "varint" types will
happily allocate a million-digit number and run out of memory. A
reproducing test is included, and this issue will be solved in one
place for all operations that have additions (including aggregations
and arithmetic expressions) in a followup pull-request.
Refs #22918 ("Support arithmetic operators")
Collections have an age-old problem in ScyllaDB: they had to be unserialized into an intermediate representation for any access or manipulation. The intermediate representation needs effort to produce and also requires additional memory to store. Both can be significant for large collections. This intermediate representation is then either discarded immediately after use, or re-serialized again.
This problem was significant enough for us to consider the use of collections as somewhat of an anti-pattern. But our customers keep using it. Alternator is also a heavy user of collections.
This PR aims to solve this problem once and for all. The plan is as follows:
* Promote direct use of the serialized collection format:
- Add accessor methods to `collection_mutation_view` which read from the serialized format directly: `tomb()`, `size()` and `begin()`/`end()`.
- Add a `collection_mutation_writer` which provides container semantics for generating a serialized `collection_mutation` directly on the go (`push_back()`).
* Replace all usage of `collection_mutation_description`, `collection_mutation_view_description` and friends with use of the new infrastructure.
* Drop the old infrastructure, to avoid accidental regressions.
Continues the work started by https://github.com/scylladb/scylladb/pull/29033 and takes it to its conclusion.
To help focus review, here is a summary of the patches:
* [1, 2] preparatory refactoring: drop some unused abstract_type params
* [3, 6] introduce new infrastructure to write and read serialized collections directly; this is the meat of the PR
* [6, -1) replace all usage of old materializing infrastructure with usage of the new one
* [-1] drop old infrastructure
**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error
```
| Metric | Before | After | Change |
|--------------------------|--------:|--------:|------------|
| Throughput (median tps) | 315,760 | 332,021 | **+5.1%** |
| Instructions/op (median) | 53,776 | 48,681 | **-9.5%** |
| CPU cycles/op (median) | 17,365 | 16,471 | **-5.1%** |
| Allocations/op | 85.1 | 82.1 | **-3.5%** |
**Significant improvement.** Throughput is up ~5%, and both instruction count and cycle count are meaningfully reduced.
---
**Command:**
```
dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write
```
| Metric | Before | After | Change |
|--------------------------|----------:|---------:|-----------|
| Throughput (median tps) | 150,823 | 149,678 | **-0.8%** |
| Instructions/op (median) | 108,388 | 103,858 | **-4.2%** |
| CPU cycles/op (median) | 34,860 | 35,371 | **+1.5%** |
| Allocations/op | ~105–108 | ~102–103 | **-3.0%** |
**Mixed, mostly neutral.** Throughput is essentially flat (within noise). Instructions/op improved by ~4%, allocations dropped slightly, but cycles/op edged up marginally.
---
**Command:**
```
dbuild -it -- build/release/scylla perf-alternator --workload write --developer-mode=1 --alternator-port=8000 --alternator-write-isolation=unsafe -c1 -m2G --default-log-level=error
```
| Metric | Before | After | Change |
|--------------------------|--------:|-------:|-----------|
| Throughput (median tps) | 55,777 | 56,051 | **+0.5%** |
| Instructions/op (median) | 246,215 |246,610 | **+0.2%** |
| CPU cycles/op (median) | 77,641 | 77,020 | **-0.8%** |
| Allocations/op | 340.4 | 335.4 | **-1.5%** |
**Essentially neutral.** All metrics are within noise margins. Slight reduction in allocations and cycles, negligible otherwise.
---
The change has a **clear, substantial positive effect on reads** (~5% throughput gain, ~9.5% fewer instructions per op).
The write and alternator paths are **unaffected in practice** — changes there are within measurement noise. No regressions are apparent.
This is expected: https://github.com/scylladb/scylladb/pull/29033 did the heavy lifting when it comes to the write path, this PR finishes the job, mostly improving reads.
Fixes: #3602
Improvement, no backport.
Closesscylladb/scylladb#29127
* github.com:scylladb/scylladb:
mutation/collection_mutation: make collection_mutation::_data private
mutation_collection: drop collection_mutation_description and friends
test: move away from collection_mutation_description
tree: move away from collection_mutation_description
test: move away from collection_mutation_view::with_deserialized()
tree: move away from collection_mutation_view::with_deserialized()
types: fix indendation, left broken by previous commit
types: move away from collection_mutation_view::with_deserialized()
types: serialize_for_cql(): use throwing_assert() instead of SCYLLA_ASSERT()
schema: column_computation: move away from collection_mutation_view::with_deserialized()
mutation: move away from collection_mutation_view::with_deserialized()
alternator: move away from collection_mutation_view::with_deserialized()
cdc: move away from collection_mutation_view::with_deserialized()
mutation/collection_mutation: printer: don't deserialize collections
mutation/collection_mutation: difference(): don't deserialize collections
mutation/collection_mutation: merge(): don't deserialize collections
mutation/collection_mutation: extract compact_and_expire() to free function
mutation/collection_mutation: refactor empty(), is_any_live() and last_update()
compaction_garbage_collector: pass collection_mutation to collect()
test/boost/mutation_test: add tests for collection_mutation_{view,writer}
mutation/collaction_mutation: collection_mutation_view: add methods to inspect content
mutation/collection_mutation: add collection_mutation_writer
mutation/collection_mutation: collection_mutation(): generate valid collection
mutation/collection_mutation: collection_mutation(): remove unused abstract_type param
mutation/atomic_cell: drop unused type param from from_bytes()
~2076 files used "Copyright (C) YYYY-present ScyllaDB" while
~88 files used "Copyright (C) YYYY ScyllaDB". This
inconsistency leads to unnecessary code review discussions
and gradual spread of the less common format.
Standardize all ScyllaDB copyright headers to use -present.
Fixes SCYLLADB-1984
Closesscylladb/scylladb#29876
This patch series adds `audit_rules`, a new audit configuration option for fine-grained, role-aware audit filtering with per-rule sink routing. Rules can be configured in `scylla.yaml` or updated live through `system.config` without restarting the node. Each rule specifies target sinks (`table`, `syslog`), statement categories, qualified table name patterns, and role patterns. Table and role patterns use POSIX `fnmatch` with extended glob syntax. For table-scoped categories (`DML`, `DDL`, `QUERY`), a rule matches only when the category, role, and qualified table name all match. For table-independent categories (`AUTH`, `ADMIN`, `DCL`), the table filter is ignored. Empty category or role lists match nothing; an empty table list matches nothing only for table-scoped categories. The new rules are additive with the existing `audit_categories`, `audit_keyspaces`, and `audit_tables` settings: both mechanisms are evaluated for each audit event, and the final sink set is the union of all matches.
To avoid evaluating glob patterns on every audit event, audit rules use a preprocessed cache of known roles and tables. The cache is kept in sync through group0 role/table snapshots, role-change notifications, and schema migration notifications. For known entities, rule matching uses precomputed role/table rule sets; unknown entities fall back to direct rule evaluation. When `audit_rules` is empty, per-event rule matching returns immediately and does not evaluate glob patterns. Audit still keeps known role/table metadata in sync while audit is enabled, so rules can be enabled later through live configuration updates without restarting the node.
**Performance**
Measured with `perf-simple-query --smp 1 --duration 100` against a null syslog socket. Results show no regression when audit is disabled, and audit-rules performance has at most 1% more instructions than legacy config for equivalent workloads:
```
===============================================================================================================================================================================
Configuration | Binary | throughput (tps) | insns/op | cpu_cycles/op | alloc/op | logal/op | task/op
===============================================================================================================================================================================
audit=none [1] | baseline | 206922.4 | 36591.6 | 15348.3 | 58.1 | 0.0 | 14.1
audit=none [1] | this PR | 207856.4 (+0.5%) | 36544.9 (-0.1%) | 15274.0 (-0.5%) | 58.1 | 0.0 | 14.1
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks [2] | baseline | 94871.8 | 54163.0 | 27172.4 | 72.0 | 0.0 | 24.0
audit=syslog keyspaces=ks [2] | this PR | 96138.4 (+1.3%) | 54072.3 (-0.2%) | 26699.3 (-1.7%) | 72.0 | 0.0 | 24.0
audit=syslog audit-rules=ks [3] | this PR | 95142.1 (+0.3%) | 54457.8 (+0.5%) | 26953.8 (-0.8%) | 72.0 | 0.0 | 24.0
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
audit=syslog keyspaces=ks-non-existent [4] | baseline | 213997.8 | 36735.6 | 14848.1 | 58.1 | 0.0 | 14.1
audit=syslog keyspaces=ks-non-existent [4] | this PR | 219297.2 (+2.5%) | 36667.3 (-0.2%) | 14500.1 (-2.3%) | 58.1 | 0.0 | 14.1
audit=syslog audit-rules=ks-non-existent [5] | this PR | 211038.7 (-1.4%) | 36999.7 (+0.7%) | 15048.6 (+1.4%) | 58.1 | 0.0 | 14.1
===============================================================================================================================================================================
[1] ./scylla perf-simple-query --smp 1 --duration 100 --audit "none"
[2] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[3] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
[4] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-keyspaces "ks-non-existent" --audit-categories "DCL,DDL,AUTH,DML,QUERY" --audit-unix-socket-path "/tmp/audit-null.sock"
[5] ./scylla perf-simple-query --smp 1 --duration 100 --audit "syslog" --audit-rules '[{"sinks":["syslog"],"categories":["DCL","DDL","AUTH","DML","QUERY"],"qualified_table_names":["ks-non-existent.*"],"roles":["*"]}]' --audit-unix-socket-path "/tmp/audit-null.sock"
audit-null.sock was created with `socat -u UNIX-RECV:/tmp/audit-null.sock,type=2 OPEN:/dev/null`
```
Fixes: SCYLLADB-1430
No backport: new feature
Closesscylladb/scylladb#29267
* github.com:scylladb/scylladb:
test: alternator: audit: rules filtering and batch bypass
test: perf: add --audit-rules option to perf-simple-query
docs: add audit rules section to the auditing guide
test: audit: cover role and schema cache notifications
test: audit: cover audit rules cluster behavior
audit: rebuild rule caches on group0 snapshot and role changes
audit: refresh rule caches on schema, role, and config changes
audit: route matching rules to configured sinks
test: cover preprocessed audit rule cache
audit: add preprocessed rule matching cache
audit: pass sink targets to storage helpers
test: audit: cover rule matching semantics
audit: add rule matching and sink helpers
test: audit: cover audit_rules configuration
config: add live audit_rules option
test: cover audit rule parsing and validation
audit: define audit_rule type with parsing and validation
Changed seastar::http::experimental to seastar::http to reflect
graduation of the seastar http API.
Changed call to seastar::rename_file() (in sstables/storage.cc,
sstables/sstable_directory.cc, sstable/sstables.cc and
db/hints/internal/hint_storage.cc) to reflect new default parameter.
Updated scylla_gdb test helper get_task() to work with updated
accept loop in Seatar. This is just test code (attempts to find
a task to operate on), not used in real scylla-gdb.py work, but
nevertheless the adjustment keeps backward compatibility.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1798
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-2043
* seastar 485a62b2...510f3148 (43):
> reactor_backend: fix iocb double-free and shutdown hang during AIO teardown
> file: fix default DMA alignment
> http: add to_reply() to redirect_exception with extra-header support
> core: propagate syscall errors via `coroutine::exception`
> file: assert dma alignments are powers of two
> doc: Document undocumented io_tester features and fix output example
> backtrace: print the build_id along with the backtrace
> reactor: default to oneline backtraces
> Merge 'json: formatter: support types with user-defined conversion to sstring' from Benny Halevy
tests: json_formatter: test formatter::write with string types
json: formatter: support types with user-defined conversion to sstring
> httpd_test: fix build failure with Seastar_SSTRING=OFF
> net/tls: introduce ssl_call wrapper for SSL I/O
> build: disable unused command line argument error for C++ module
> coroutine/generator: fix setup of generator's waiting task
> tests/tls: set 1000-day validity for self-signed CA cert
> net: tls: openssl: disable certificate compression
> reactor: reduce steady_clock::now() calls per scheduling quantum
> fair_queue: remove notify_request_finished()
> loop: use small_vector for parallel_for_each_state incomplete futures
> dodge false sharing in spinlock
> Merge 'Handle nowait support for reads and writes independently' from Pavel Emelyanov
file: Change nowait_works mode detection
file: Introduce read-only nowait_mode
filesystem: Make nowait_works bit a enum class too
file: Make nowait_works bit a enum class
> Merge 'net/tls: improve OpenSSL error queue hygiene' from Gellért Peresztegi-Nagy
net/tls: assert clean error queue before SSL operations
net/tls: clear error queue after successful SSL operations
net/tls: clear error queue after successful SSL_CTX_new
net/tls: drain error queue on unexpected error codes
net/tls: use make_openssl_error for BIO creation failure
> vla.hh: add missing includes
> Merge 'smp: make smp::count non-static' from Avi Kivity
smp: convert all smp::count usages to instance-aware alternatives
smp: add per-instance shard_count and this_smp() infrastructure
disk_params: document pre-init smp::count access with explicit 0
reactor_backend: document pre-init smp::count access with explicit 0
tests: alien_test: pass shard count to alien thread explicitly
> build: fix cmake missing ninja on Ubuntu 26.04
> rpc: Fix uint64 wraparound of expired timeout in send_entry()
> Merge 'Generalize some RPC tests' from Pavel Emelyanov
tests: Generalize async connection-based scheduling RPC tests
tests: Generalize sync connection-based scheduling RPC tests
tests: Remove redundant variadic/nonvariadic RPC tuple tests
tests: Generalize max timeout RPC tests
> net: tls: openssl: Share BIO ptrs across shards
> http: fix compilation on clang 22 with c++26
> build: openssl tools needed for test cert generation
> reactor: support rename2
> future: fix forwarding of reference types
> Merge 'Zero-copy http chunked data sink' from Pavel Emelyanov
http: Make chunked data sink zero-copy
tests/prometheus_http: Rewrite on top of http::client
tests/httpd: Rewrite content_length_limit on top of http::client
> tests: Replace ad-hoc http_consumer with production HTTP parser
> Merge 'co_return to accept same expressions and types as return' from Alexey Bashtanov
tests/unit/{coroutines,futures}: strict types on co_return and set_value
api: introduce version 10:
core/{coroutine,future}: make `co_return` more strict with types
core/{coroutine,future}: preparations to fix `co_return` type semantics
> Merge 'Perftune.py: add special handling for mlx5 rss queues number calculation' from Vladislav Zolotarov
perftune.py: NetPerfTuner: enhance RSS (a.k.a. "Rx") queues accounting for mlx5 devices
perftune.py: update docstring of NetPerfTuner.__get_rps_cpus() method
perftune.py: add a method that parses and models the output of the 'ethtool -l' command for a given interface
> httpd: rewrite do_accepts/do_accept_one as coroutines
> file: add mmap support to file
> http: Move client code out of experimental namespace
> file: add hugetlbfs support to file system detection
> tests: Replace test_source_impl with util::as_input_stream
> tests: Replace buf_source_impl with util::as_input_stream
> Merge 'rpc_tester: expose throuput for rpc tester' from Marcin Szopa
rpc_tester: remove unused payload size variable from job_rpc_streaming class
rpc_tester: add start time tracking for throughput calculation, print throughput and msg/s for job_rpc
rpc_tester: refactor result emission to use dedicated functions for messages and throughput
> iostream: cast first argument of `std::min` to `size_t`
Closesscylladb/scylladb#29952
The rule cache is the fast path for matching, so its hit,
fallback, refresh, and category-bypass behavior needs
focused unit coverage.
Test transparent hash consistency, cached and uncached
lookup paths, incremental entity add/remove, rule
refresh, and empty-rules short circuit.
Refs SCYLLADB-1430
Rule matching is reused by both the preprocessed cache and
the fallback path -- unit-test it separately so coupling
failures do not mask matching bugs.
Cover category bitmask, glob patterns for tables and
roles, AUTH/ADMIN/DCL table bypass, empty-keyspace batch
bypass, and sink bitmask conversion.
Refs SCYLLADB-1430
Audit rules enter through three paths (YAML, CQL, CLI),
each with its own parsing and tracking -- cover all entry
points before routing can depend on them.
Test loading from YAML, live update via CQL and server
API, CLI parsing, invalid value rejection at each path,
and observer notification on live update.
Refs SCYLLADB-1430
Parsing and validation are the first consumer-visible
surface of audit rules -- cover them before building
higher layers.
Test JSON parsing (valid, malformed, missing fields),
rule validation (unknown sinks, invalid categories),
and JSON round-trip serialization.
Refs SCYLLADB-1430
Replace get_injection_parameters().contains() with inject_parameter()
for polling the "suspended" signal. The inject_parameter() API is more
appropriate for checking a single parameter and reduces the usage of the
lower-level get_injection_parameters() bulk accessor.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
value_to_json() converts CQL values to JSON for vector search filters.
For decimal and varint types, it used rjson::parse() on the JSON string,
which parses through a double and silently loses precision for values
exceeding ~15 significant digits — producing wrong filter results.
Additionally, for decimal type we need an exact string representation
that preserves the original (unscaled, scale) pair, because partition
keys use byte-level identity: different serialized representations of
the same numeric value are distinct rows, so the filter must reproduce
the exact representation stored in the key.
Add big_decimal::to_string_canonical() which follows the Java BigDecimal
toString() spec (JDK 8+), producing a bijective string representation
that uses exponential notation for extreme scales instead of expanding
trailing zeros (which could cause OOM). This could replace to_string(),
but doing so has wider consequences (e.g. hash/equality contract for
decimal_type) described in SCYLLADB-1574. Use it in value_to_json() for
decimal_type, and use rjson::from_string() for varint_type, both
bypassing the lossy double parse path.
Tests cover the new to_string_canonical() and the filter fix, as well as
existing decimal type behavior (key representation, clustering order,
toJson) that we rely on and must not break. The CQL decimal type tests
(test_type_decimal.py) also pass against Cassandra.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1583
Refs: https://scylladb.atlassian.net/browse/SCYLLADB-1574Closesscylladb/scylladb#29505
Auth modules (authenticators, role managers, and auth::service) access their configuration options by reaching into db::config through the query processor. This abuses database as proxy object to get configuration.
This series introduces a dedicated auth::config struct that carries the configuration options used by auth modules.The config is populated in main.cc and delivered to each shard via sharded_parameter. This makes auth service conform to the overall design, where db::config is split into smaller per-service configs on start, thus decoupling individual components/services from global configuration.
Cleaning components dependencies, not backporting.
Closesscylladb/scylladb#29870
* github.com:scylladb/scylladb:
auth: Remove unused default_superuser() function
auth: Switch role managers to use auth::config
auth: Switch authenticators to use auth::config
auth: Introduce auth::config and wire it through service
Convert all role manager implementations to receive their
configuration from auth::config instead of accessing db::config
through the query processor:
- standard_role_manager: reads superuser name from config
- ldap_role_manager: reads LDAP URL template, attribute, bind
credentials, and permissions update interval from config;
passes config to inner standard_role_manager
- maintenance_socket_role_manager: keeps a const reference to
service's config and passes it directly when lazily
constructing standard_role_manager
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add test_tablet_routing_info_after_cas_shard_bounce that verifies
TABLETS_ROUTING_V1 payload is returned after an internal CAS shard
bounce.
The test simulates the transport-layer bounce: it creates a table whose
single tablet replica lands on a shard different from the test thread,
executes an LWT (which bounces), then transfers client_state via
client_state_for_another_shard (preserving _original_shard) and
re-executes on the tablet shard. The test asserts that check_locality()
correctly detects the misrouting and returns tablet routing info.
Refs SCYLLADB-2041
Drop local formatter for seastar::http::reply, which should have
been added to Seastar in the first place, and now conflicts. Also
drop local formatters for types that are aliases for Seastar types
which have gained formatters.
Disable recently-gained TLS use of OpenSSL instead of gnutls. We
don't need it, and it causes link errors with LTO.
Fix incorrect skipping in encrypted_file_test, which computed
the remaining stream length but did not account for already
consumed size_to_compare.
Change utils::gcp::storage::client::object_data_source::skip()
to match new Seastar behavior (rejecting skip-past-eof with an
exception). This is needed since 30f1075544 switched the test's
data source to a Seastar implementation. It is also more correct -
if we're asked to skip n bytes but the stream doesn't have n bytes,
this is a protocol violation.
Contains test fix from Pavel, exposed by [1]:
test: Handle premature EOF in test_gcp_storage_skip_read
The test intentionally uses file_size larger than the actual object to
exercise EOF behavior. When input_stream::skip() is called after EOF,
it throws std::runtime_error("premature end of stream"). Catch this
specific exception from both streams, verify they agree, and exit the
loop gracefully.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
[1] cbd1e17d2f, included in this Seastar submodule update
* seastar 4d268e0e...485a62b2 (50):
> reactor: open_directory(): honor bypass_fsync
> http: Add formatters for http::request and http::reply
> Merge 'Assorted set of io-tester cleanups' from Pavel Emelyanov
io_tester: Remove unused and internal-only accessor
io_tester: Move think-time machinery into thinker_state
io_tester: Move _file to io_class_data
io_tester: Replace class_data::_start member with a local variable
io_tester: Move _alignment from class_data to io_class_data
io_tester: Remove buffer allocation from top-level request issuing
io_tester: Cleanup context::stop() invocation
io_tester: Allocate write buffer once to fill a file
io_tester: Declare quantiles arrays as static constexpr
io_tester: Drop class_data::type_str()
io_tester: Replace != "" comparisons with .empty()
io_tester: Replace gen_class_data() if/else chain with a switch
io_tester: Deduplicate vectorized I/O classes
> io_tester: fix crash from missing metric during startup
> net: tls: adjust openssl integration to new module support
> http/client: Count and export integrated queue length
> Merge 'Introduce pipe_data_source_impl and pipe_data_sink_impl' from Pavel Emelyanov
fstream: add pipe_data_source_impl and pipe_data_sink_impl
pollable_fd: add write_some/write_all backed by writev
pollable_fd: rename write_some/write_all(iovec) to send_some/send_all
> reactor: Make pollable_fd_state helper methods private
> module: extend seastar.cppm with comprehensive public API exports
> Merge 'Add exhaustive input_stream invariant test + fixes' from Pavel Emelyanov
tests: add exhaustive input_stream read/skip invariant test
iostream: make skip() reject premature end of stream with exception
> Merge 'Allow runtime selectability of GnuTLS or OpenSSL' from Noah Watkins
net/tls: avoid potential read-past-buffer
net/tls: move credential methods to generic tls layer
net/tls: rename credentials_impl::dh_params to set_dh_params
test/tls: enable openssl tls unit test
test/tls: fix CA cert generation to use v3_ca extensions
github: disable parallel test execution in alpine workflow
crypto: support compiling seastar without gnutls
net/tcp: use crypto provider for md5 calculation
tls: fix test_peer_certificate_chain_handling for OpenSSL
net/tls: fix test for self-signed server cert opoenssl compat
net/tls: disable priority strings test for openssl provider
core/crypto: expose crypto backend name for introspection
test/tls: remove gnutls version guard
net/tls: add openssl tls backend
http: use backend agnostic tls error code
net/tls: make error codes configurable by each tls backend
net/tls: move reloadable_credentials to generic tls layer
net/tls: move build_certificate to generic tls layer
net/tls: move apply_to() to generic tls layer
net/tls: move credential methods to generic tls layer
net/tls: add OpenSSL-specific methods to public API with no-op defaults
net/tls: introduce dh_params and credentials abstraction layer
net/tls: add credentials_impl abstract base class
net/tls: dispatch tls::error_category() through crypto_provider
net/tls: dispatch wrap_client/wrap_server through crypto_provider
net/tls: add tls_backend interface to crypto_provider
net/tls: move public tls API methods to generic tls layer
net/tls: move formatting utilities to generic tls layer
net/tls: move credentials_builder blob methods to generic tls layer
net/tls: move dh_params::from_file to generic tls layer
net/tls: move abstract_credentials file methods to generic tls layer
net/tls: move tls_socket_impl to generic tls layer
net/tls: move server_session to general tls layer
net/tls: move tls_connected_socket_impl to generic tls layer
net/tls: move net::get_impl to generic tls layer
net/tls: move session_ref to generic tls layer
net/tls: add session_impl abstract interface for tls pluggability
net/tls: rename tls.cc to be gnutls specific
crypto: introduce crypto provider abstraction
http: remove unused include
> tls: test_send_two_large
> rpc: include exception type for remote errors
> GHA: increase timeout to 60 minutes
> apps/httpd: replace deprecated reply::done() with write_body()
> missing header(s)
> net: Fix missing throw for runtime_error in create_native_net_device
> tests/io_queue: account for token bucket refill granularity in bandwidth checks
> Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
tests: add regression tests for zero-length iovec handling
iovec: fix iovec_trim_front infinite loop on zero-length iovecs
> util/process: graduate process management API from experimental
> cooking: don't register ready.txt as a build output
> sstring: make make_sstring not static
> Add SparkyLinux to debian list in install-dependencies.sh
> http: allow control over default response headers
> Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
tests/perf: add chunked_fifo microbenchmarks
chunked_fifo: set the default free chunk retention to 0
chunked_fifo: make free chunk retention configurable
> Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
tests: add regression test for pollable_fd_state_completion reuse
reactor_backend: use reset() in AIO and epoll poll paths
reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
> Merge 'coroutine: Generator cleanups' from Kefu Chai
coroutine/generator: extract schedule_or_resume helper
coroutine/generator: remove unused next_awaiter classes
coroutine/generator: remove write-only _started field
coroutine/generator: assert on unreachable path in buffered await_resume
coroutine/generator: add elements_of tag and #include <ranges>
coroutine/generator: add empty() to bounded_container concept
> cmake: bump minimum Boost version to 1.79.0
> seastar_test: remove unnecessary headers
> cmake: bump minimum GnuTLS version to 3.7.4
> Merge 'reactor: add get_all_io_queues() method' from Travis Downs
tests: add unit test for reactor::get_all_io_queues()
reactor: add get_all_io_queues() method
reactor: move get_io_queue and try_get_io_queue to .cc file
> http: deprecate reply::done(), remove _response_line dead field
> core: Deprecate scattered_message
> ci: add workflow dispatch to tests workflow
> perf_tests: exit non-zero when -t pattern matches no tests
> Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
> perf_tests: add total runtime to json output
> Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
implement move assignment operator for json_list_template
json_list_template copy assignment operator reserves capacity upfront
> perf_tests: add --no-perf-counters option
> Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
memory: Add compile-time test for value-to-human-readable conversion
memory: Extend list of suffixes to have peta-s
memory: Fix off-by-one in suffix calculation
memory: Mark to_human_readable_value() and others constexpr
> http: Improve writing of response_line() into the output
> Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
websocket: add template parameter for text/binary frame mode
websocket: impl client side websocket function
> file: Fix checks for file being read-only
> reactor: Make do_dump_task_queue a task_queue method
> Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
tests/output_stream: sample type patterns in sanitizer builds
tests/output_stream: extend invariant test to cover mixed write modes
iostream: allow unrestricted mixing of buffered and zero-copy writes
tests/output_stream: remove obsolete ad-hoc splitting tests
tests/output_stream: add invariant-based splitting tests
iostream: rename output_stream::_size to ::_buffer_size
> reactor_backend: replace virtual bool methods with const bool_class members
> resource: Avoid copying CPU vector to break it into groups
> perf_tests: increase overhead column precision to 3 decimal places
> Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
reactor: Deprecate fdatasync() method
file: Do fdatasync() right in the posix_file_impl::flush()
file: Propagate aio_fdatasync to posix_file_impl
reactor: Move reactor::fdatasync() code to file.cc
reactor,file: Make full use of file_open_options::durable bit
file: Add file_open_options::durable boolean
file: Account io_stats::fsyncs in posix_file_impl::flush()
reactor: Move _fsyncs counter onto io_stats
> http: Remove connection::write_body()
Closesscylladb/scylladb#29553
- Fix intranode shard balancing to respect the size-based balance threshold, preventing unnecessary migrations when load difference between shards is negligible
- Add a regression test that verifies the threshold is respected for intranode balancing
The intranode shard balancing loop only stopped when the algorithm exhausted the migration candidates or when a migration would go against convergence (it would increase imbalance instead of decrease it). This caused unnecessary tablet migrations for negligible imbalances (e.g., 0.78% difference between shards).
The inter-node balancer already uses `is_balanced()` to stop when the relative load difference is within the configured `size_based_balance_threshold`, but this check was missing from the intranode path.
Apply the same `is_balanced()` threshold check that is already used for inter-node balancing to the intranode convergence loop. When the relative load difference between the most-loaded and least-loaded shards on a node is within the threshold, the balancer now stops without issuing further migrations.
The test creates a single node with 2 shards and 512 tablets:
1. **Balanced scenario** (257 vs 255 tablets, same size): relative diff = 0.78% < 1% threshold → verifies no intranode migration is emitted
2. **Unbalanced scenario** (307 vs 205 tablets, same size): relative diff = 33% >> 1% threshold → verifies intranode migration IS emitted
Fixes: SCYLLADB-1775
This is a performance improvement which reduces the number of intranode migrations issued, and needs to be backported to versions with size-based load balancing: 2026.1 and 2026.2
Closesscylladb/scylladb#29756
* github.com:scylladb/scylladb:
test: add test for intranode balance threshold in size-based mode
tablet_allocator: apply balance threshold to intranode shard balancing
The mechanics of the restore is like this
- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
- First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
- Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
- Reading the snapshot_sstables table
- Filtering the read sstable infos against current node and tablet being handled
- Downloading and attaching the filtered sstables
This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.
This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)
Other follow-up items:
- have an actual swagger object specification for `backup_location`
Closes#28436Closes#28657Closes#28773Closesscylladb/scylladb#28763
* github.com:scylladb/scylladb:
docs: Update topology_over_raft.md with `restore` transition kind
test: Add test for backup vs migration race
test: Restore resilience test
sstables_loader: Fail tablet-restore task if not all sstables were downloaded
sstables_loader: mark sstables as downloaded after attaching
sstables_loader: return shared_sstable from attach_sstable
db: add update_sstable_download_status method
db: add downloaded column to snapshot_sstables
db: extract snapshot_sstables TTL into class constant
test: Add a test for tablet-aware restore
tablets: Implement tablet-aware cluster-wide restore
messaging: Add RESTORE_TABLET RPC verb
sstables_loader: Add method to download and attach sstables for a tablet
tablets: Add restore_config to tablet_transition_info
sstables_loader: Add restore_tablets task skeleton
test: Add rest_client helper to kick newly introduced API endpoint
api: Add /storage_service/tablets/restore endpoint skeleton
sstables_loader: Add keyspace and table arguments to manfiest loading helper
sstables_loader_helpers: just reformat the code
sstables_loader_helpers: generalize argument and variable names
sstables_loader_helpers: generalize get_sstables_for_tablet
sstables_loader_helpers: add token getters for tablet filtering
sstables_loader_helpers: remove underscores from struct members
sstables_loader: move download_sstable and get_sstables_for_tablet
sstables_loader: extract single-tablet SST filtering
sstables_loader: make download_sstable static
sstables_loader: fix formating of the new `download_sstable` function
sstables_loader: extract single SST download into a function
sstables_loader: add shard_id to minimal_sst_info
sstables_loader: add function for parsing backup manifests
split utility functions for creating test data from database_test
export make_storage_options_config from lib/test_services
rjson: Add helpers for conversions to dht::token and sstable_id
Add system_distributed_keyspace.snapshot_sstables
add get_system_distributed_keyspace to cql_test_env
code: Add system_distributed_keyspace dependency to sstables_loader
storage_service: Export export handle_raft_rpc() helper
storage_service: Export do_tablet_operation()
storage_service: Split transit_tablet() into two
tablets: Add braces around tablet_transition_kind::repair switch
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.
The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.
Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.
Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.
This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2
Fixes: SCYLLADB-1803
Closesscylladb/scylladb#29791
* github.com:scylladb/scylladb:
test: boost: add drain test for forced capacity-based balancing
service: allow draining with forced capacity-based balancing
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.
This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:
- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)
The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.
The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.
**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated
Refs: SCYLLADB-1087
Backport: new feature, no backport
Closesscylladb/scylladb#29324
* github.com:scylladb/scylladb:
sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
sstables: introduce --abort-on-malformed-sstable-error infrastructure
sstables: refactor parse_path() to return std::expected<> instead of throwing
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.
New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.
A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.
A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.
Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.
Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.
| Test Name | new test specific retry strategy execution time (ms) | original execution time (ms) | Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy | 19,261 | 61,395 | −42,134 | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy | 16,901 | 53,688 | −36,787 | **3.2×** |
| test_client_upload_file_single_part_proxy | 3,478 | 6,789 | −3,311 | **2.0×** |
| test_client_multipart_copy_upload_proxy | 1,303 | 1,619 | −316 | 1.2× |
| test_client_put_get_object_proxy | 150 | 365 | −215 | **2.4×** |
| test_client_readable_file_stream_proxy | 125 | 327 | −202 | **2.6×** |
| test_small_object_copy_proxy | 205 | 389 | −184 | 1.9× |
| test_client_put_get_tagging_proxy | 181 | 350 | −169 | 1.9× |
| test_client_multipart_upload_proxy | 1,252 | 1,416 | −164 | 1.1× |
| test_client_list_objects_proxy | 729 | 881 | −152 | 1.2× |
| test_chunked_download_data_source_with_delays_proxy | 830 | 960 | −130 | 1.2× |
| test_client_readable_file_proxy | 148 | 279 | −131 | 1.9× |
| test_client_upload_file_multi_part_with_remainder_minio | 3,358 | 3,170 | +188 | 0.9× |
| test_client_upload_file_multi_part_without_remainder_minio | 3,131 | 2,929 | +202 | 0.9× |
| test_client_upload_file_single_part_minio | 519 | 421 | +98 | 0.8× |
| test_download_data_source_proxy | 180 | 237 | −57 | 1.3× |
| test_client_list_objects_incomplete_proxy | 590 | 641 | −51 | 1.1× |
| test_large_object_copy_proxy | 952 | 991 | −39 | 1.0× |
| test_client_multipart_upload_fallback_proxy | 148 | 185 | −37 | 1.3× |
| test_client_multipart_copy_upload_minio | 641 | 674 | −33 | 1.1× |
No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.
Closesscylladb/scylladb#29508
* github.com:scylladb/scylladb:
test: extract object storage helpers to test/pylib/object_storage.py
test: add per-test bucket isolation to object_store fixtures
s3: add client::make overload with custom retry strategy
test: add s3_test_fixture and migrate tests to per-bucket isolation
s3: add create_bucket and delete_bucket to client