Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records. Each record carries:
- large_data_type (partition_size, row_size, cell_size, etc.)
- binary serialized partition key and clustering key
- column name (for cell records)
- value (size in bytes)
- element count (rows or collection elements, type-dependent)
- range tombstones and dead rows (partition records only)
The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.
Add JSON support in scylla-sstable and format documentation.
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
Move the key_to_str() template function from a file-local static in
db/large_data_handler.cc to keys/keys.hh so it can be reused by:
- large_data_handler.cc for log messages
- virtual tables (db/virtual_tables.cc) for converting binary keys
to human-readable CQL display
- scylla-sstable for JSON output of LargeDataRecords
No functional change.
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.
Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.
This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.
This patch fixes the affected tests in two ways:
1. Where possible, tests are reworked to not assume a single SSTable:
- test_paging
- test_slicing_rows
- test_many_partition_scan
2. Where rework is impractical, major compaction is added after writes
and before validation to ensure that only one SSTable will exist:
- test_smoke
- test_count
- test_metadata_and_value
- test_slicing_range_tombstone_changes
Fixes SCYLLADB-1375.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#29389
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.
Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.
Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".
build/release/scylla perf-tablets:
Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)
Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full 407.69ms
partial 2.65ms
```
After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full 354.50ms
partial 2.19ms
```
Closesscylladb/scylladb#28459
* github.com:scylladb/scylladb:
test: boost: tablets: Add test for merge with arbitrary tablet count
tablets, database: Advertise 'arbitrary' layout in snapshot manifest
tablets: Introduce pow2_count per-table tablet option
tablets: Prepare for non-power-of-two tablet count
tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
tablets: Prepare resize_decision to hold data in decisions
tablets: table: Make storage_group handle arbitrary merge boundaries
tablets: Make stats update post-merge work with arbitrary merge boundaries
locator: tablets: Support arbitrary tablet boundaries
locator: tablets: Introduce tablet_map::get_split_token()
dht: Introduce get_uniform_tokens()
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.
The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.
The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.
Fixes: SCYLLADB-1466
Closesscylladb/scylladb#29460
Use async tablet repair task flow to avoid a race where client timeout
returns while server-side repair continues after injections are
disabled.
Start repair with await_completion=false, assert it does not complete
within timeout under injection, abort/wait the task, then verify
sstables_repaired_at is unchanged.
Fixes SCYLLADB-1184
Closesscylladb/scylladb#29452
Currently cancellation is logged in get_next_task, but the function is
called by tablets code as well where we do not act upon its result, only
yield to the topology coordinator. But the topology coordinator will not
necessary do the cancellation as well since it can be busy with tablets
migration. As a result cancellation is logged, but not done which is
confusing. Fix it by logging cancellation when it is actually happens.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409Closesscylladb/scylladb#29471
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.
The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.
1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)
Refs #1.
Closesscylladb/scylladb#28992
* github.com:scylladb/scylladb:
config: move named_value<T> method bodies out-of-line
config: suppress named_value<T> instantiation in every source file
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)
Closesscylladb/scylladb#29463
* github.com:scylladb/scylladb:
test: filter benign errors in tests that grep logs during shutdown
test: filter_errors: support list[list[str]] error groups
When a replica disconnects during a digest read (e.g., during
decommission), the speculating_read_executor now immediately fires
the pending speculative retry instead of waiting for the timer.
On DISCONNECT, the digest_read_resolver invokes an _on_disconnect
callback set by the executor. The callback cancels the speculate
timer and rearms it to clock_type::now() (lowres_clock::now() =
thread-local memory read, no syscall). The existing timer callback
fires on the next reactor poll with all its logic intact — checking
is_completed(), calling add_wait_targets(1), sending the request,
and incrementing speculative_digest_reads/speculative_data_reads.
The notification is fire-and-forget: on_error() does NOT absorb the
DISCONNECT. The existing error arithmetic in digest_read_resolver
already handles this correctly because _target_count_for_cl accounts
for the speculative target.
For never_speculating_read_executor (no spare target) and
always_speculating_read_executor (all requests sent upfront),
_on_disconnect is never set — no behavior change.
Fixesscylladb/scylladb#26307Closesscylladb/scylladb#29428
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.
Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.
Steps that lead to the problem:
1. Merge is triggered. The load balancer generates co-location migrations
for all sibling pairs that are not yet on the same shard. Some pairs
finish co-location before others.
2. Once all pairs are co-located in committed state,
all_sibling_tablet_replicas_colocated returns true and finalize_resize
is set. Meanwhile the load balancer may have already started a regular
LB migration on one co-located pair (both tablets are stable and the
load balancer is free to move them).
3. The LB migration moves both tablets together (colocated_tablets). Their
two del_transition commits land in separate Raft rounds. After the first
commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.
4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
NOT co-located, clears finalize_resize, and the load balancer generates
new migrations for other tablets to rebalance the load that the pair
move created.
5. Those new migrations can take tens of seconds to stream, keeping the
coordinator in handle_tablet_migration mode and preventing
maybe_start_tablet_resize_finalization from being called. The merge
finalization is delayed until all those cascaded migrations complete.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29465
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.
Cleaning components inter-dependencies, not backporting
Closesscylladb/scylladb#29429
* github.com:scylladb/scylladb:
test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
describe_statement: Get cluster info from storage_service
storage_service: Add describe_cluster() method
query_processor: Expose storage_service accessor
Left over from primordial times, when reader_concurrency_semaphore was
baseclass for extensions in the separate enterprise repository.
Also remove the now unneded virtual marker from the destructor.
Closesscylladb/scylladb#29399
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.
The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.
To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.
This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.
Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.
Fixes: VECTOR-610
Closesscylladb/scylladb#29407
* github.com:scylladb/scylladb:
docs: document vector index metadata and duplicate handling
test/cqlpy: cover vector index duplicate creation rules
vector_index: allow multiple named indexes on one column
vector_index: store `index_version` as creation timeuuid
Commit 234f905 (sstables: scylla_metadata: add schema member) added a
new Schema subcomponent (tag 11) to scylla_metadata. Document it in the
sstable Scylla format reference:
- Add schema to the subcomponent grammar enumeration
- Add a summary entry describing the subcomponent (tag 11) and its purpose
- Add a detailed ## schema subcomponent section with the binary grammar,
covering table_id, table_schema_version, keyspace_name, table_name and
the column_description array (column_kind, column_name, column_type)
Fixes https://github.com/scylladb/scylladb/issues/27960
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#28983
RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication.
Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds.
If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed.
Fixes: SCYLLADB-109.
Requires backport to all versions with tablet keyspace rf change.
Closesscylladb/scylladb#28709
* github.com:scylladb/scylladb:
test: add test_failed_tablet_rebuild_is_retried_on_alter
test: add a test to ensure that failed rebuilds are retried
service: fail ALTER KEYSPACE if replicas do not satisfy the replication
service: retry failed tablet rebuilds
service: maybe_start_tablet_migration returns std::optional<group0_guard>
Related scylladb/scylladb-docs-homepage#153.
make multiversion failed under Sphinx 8+ with:
```
sphinx-build: error: argument --tag/-t: expected one argument
subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2.
make: *** [multiversion] Error 1
```
sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional.
Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter
argparse CLI rejects it. Instead, it now reads FLAGS from an env variable.
How to test:
````
make multiversion
make FLAG=opensource multiversion
````
Both complete and switch variants correctly.
chore: rm empty lines
Closesscylladb/scylladb#29472
Three bugs fixed in segment_manager.cc:
1. write_to_separator(): captured [&index] where index was a local
coroutine-frame reference. The future is stored in
buf.pending_updates and resolved later in flush_separator_buffer(),
by which time the enclosing coroutine frame is destroyed, making
&index a dangling pointer. This is a use-after-free that manifests
as a segfault. Fix: capture index_ptr (raw pointer by value) instead.
2. add_segment_to_compaction_group(): same dangling [&index] pattern
inside the for_each_live_record lambda during recovery. Same fix
applied.
3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
'log_location loc', causing the function to always return a
zero-initialized log_location{}. Fix: remove 'auto' so the
assignment targets the outer variable.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29451
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.
Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.
Fixes SCYLLADB-1477
Closesscylladb/scylladb#29438
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#29427
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.
Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.
Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
No backport: enhancement
Closesscylladb/scylladb#29422
* github.com:scylladb/scylladb:
cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
test: cluster: dtest: Fix double-counting of metrics
`LDAPRoleManager` interpolated usernames directly into `ldap_url_template`,
allowing LDAP filter injection and URL structure manipulation via crafted
usernames.
This PR adds two layers of encoding when substituting `{USER}`:
1. **RFC 4515 filter escaping** — neutralises `*`, `(`, `)`, `\`, NUL
2. **URL percent-encoding** — prevents `%`, `?`, `#` from breaking
`ldap_url_parse`'s component splitting or undoing the filter escaping
It also adds `validate_query_template()` at startup to reject templates
that place `{USER}` outside the filter component (e.g. in the host or
base DN), where filter escaping would be the wrong defense.
Fixes: SCYLLADB-1309
Compatibility note:
Templates with `{USER}` in the host, base DN, attributes, or extensions
were previously silently accepted. They are now rejected at startup with
a descriptive error. Only templates with `{USER}` in the filter component
(after the third `?`) are valid.
Fixes: SCYLLADB-1309
Due to severeness, should be backported to all maintained versions.
Closesscylladb/scylladb#29388
* github.com:scylladb/scylladb:
auth: sanitize {USER} substitution in LDAP URL templates
test/ldap: add LDAP filter-injection reproducers
Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE
in create_new_test_keyspace to ensure all nodes have applied the schema
before proceeding.
Closesscylladb/scylladb#29371
The alter_table case has a known failure where point lookups at QUORUM
return 0 rows after node2 restarts, even though:
- the schema was correctly synced (ALTER TABLE received from cluster)
- the data commitlog was replayed (21 mutations, 0 skipped)
- all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by
node1+node3 regardless of node2's state
The LIMIT 1 table scan succeeds (data is present somewhere), but
specific key lookups return empty. This points to a bug in how node2,
acting as coordinator after restart, routes single-partition reads —
most likely stale tablet routing metadata.
Add diagnostics to help distinguish data loss from a coordinator/routing
bug on the next failure:
- log which key is missing
- dump all rows visible at QUORUM
- query each node individually at ONE consistency for the missing key
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29350
The real keyspace name of an Alternator table T is "alternator_T".
Expand the "alternator.T" format used in the audit_tables config flag
to the real keyspace name at parse time, so users don't need to spell
out the internal "alternator_T.T" form.
Add tests for the unhappy path of Alternator audit logging:
- Category filtering: operations are not logged when their category
(DML, QUERY, DDL) is excluded from audit_categories.
- Keyspace filtering: operations on a keyspace not listed in
audit_keyspaces are not logged.
- Error entries: a failed operation (thrown exception after audit_info
is set) produces an audit entry with error=true.
- Empty-keyspace bypass: global operations like ListTables and
DescribeEndpoints are logged regardless of audit_keyspaces because
should_log() short-circuits on an empty keyspace.
Prepare API in audit for auditing Alternator.
The API provides an externally-callable functions `inspect()`,
for both CQL and Alternator.
Both variants of the function would unpack parameters and merge into
calling a common `maybe_log()`, which can then call `log()` when
conditions are met.
Also, while I was at it, (const) references were favoured over raw
pointers.
The Alternator audit_info subclass (audit_info_alternator) carries an
optional consistency level — only data read/write operations have a
meaningful CL, while DDL and metadata queries store an empty string
in the audit table and syslog (matching the existing write_login
behavior). The storage helpers are updated accordingly.
Add a will_log(category, keyspace, table) method that checks whether
an operation should be audited (category check AND keyspace/table
filtering) without requiring a constructed audit_info object.
should_log() delegates to will_log().
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.
Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.
We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
This is a step towards more flexibility in managing tablets. A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.
After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.
We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.
One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.
Example (3 tablets):
[0] owns 1/3 of tokens
[1] owns 1/3 of tokens
[2] owns 1/3 of tokens
After merge:
[0] owns 2/3 of tokens
[1] owns 1/3 of tokens
What we would like instead:
Step 1 (split [1]):
[0] owns 1/3 of tokens
[1] old 1.left, owns 1/6 of tokens
[2] old 1.right, owns 1/6 of tokens
[3] owns 1/3 of tokens
Step 2 (merge):
[0] owns 1/2 of tokens
[1] owns 1/2 of tokens
To do that, we need to be able to split individual tablets, but we're
not there yet.
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to
verify that the new CQL per-row TTL feature does all its work (expiration
scanning, deletion of expired items) on all nodes in the "streaming"
scheduling group, not in the statement scheduling group.
As originally written, the test couldn't require that it uses exactly zero
time in the statement scheduling group - because some things do happen
there - specifically the ALTER TABLE request we use to enable TTL.
So the test checked that the time in the "wrong" group is less than 0.2
of the total time, not zero.
But in one CI run, we got to exactly 0.2 and the test failed. Running
this test locally, I see the margin is pretty narrow: The test almost
always fails if I set the threshold ratio to 0.1.
The solution in this patch is to move the ALTER TABLE work to a different
scheduling group (by using an additional service level). After doing that
the CPU usage in sl:default goes down to exactly zero - not close to zero
but exactly zero.
However, it seems that there is always some rare background work in
sl:default and debug builds it can come out more than 0ms (e.g., in
one test we saw 1ms), so we keep checking that sl:default is much
lower than sl:stream - not exactly zero.
Incidentally, I converted the serial loop adding the 200 rows in the
test's setup to a parallel loop, to make the test setup slightly faster.
I also added to the test a sanity check that the scheduling group sl:default
that we are measuring that TTL does zero work in, is actually the scheduling
group that normal writes work in (to avoid the risk of having a test that
verifies that some irrelevant scheduling group is unsurprisingly getting
zero usage...).
Fixes SCYLLADB-1495.
Closesscylladb/scylladb#29447
We only assume that new tablets share boundaries with some old tablets.
In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
There are several reasons we want to do that.
One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.
Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.
Implementation details:
We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).
Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):
For 131'072 tablets:
Before:
Size of tablet_metadata in memory: 57456 KiB
After:
Size of tablet_metadata in memory: 59504 KiB
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.
This will make later refactoring easier.
Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`.
Fixes: SCYLLADB-287
This is an improvement, and backport is not needed.
Closesscylladb/scylladb#28770
* github.com:scylladb/scylladb:
test: verify hints are delivered during tablet RF reduction
hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
erm: add is_leaving() to effective_replication_map
Fixes a race condition where tablet split can crash the server during truncation.
`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.
In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.
A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:
- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.
The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.
Fixes: SCYLLADB-1035
This needs to be backported to all currently supported version.
Closesscylladb/scylladb#29250
* github.com:scylladb/scylladb:
test: add test_split_emitted_during_truncate
table: fix race between tablet split and truncate