1759 Commits

Author SHA1 Message Date
Michael Litvak
5a5c7c6241 strong_consistency: wait for raft servers to start in create table
When creating a strongly consistent table, wait for the table's raft
servers to start and be ready to serve queries before completing the
operation. We want the create table operation to absorb the delay of
starting the raft groups instead of the first queries.

The create table coordinator commits and applies the schema statement,
then it waits for all hosts that have a tablet replica to create and
start the raft groups for the table's tablets. It does this by sending
an RPC to all the relevant hosts that executes a group0 barrier, in
order to ensure the table and raft groups are created, then waits for
all raft groups on the host to finish starting and be ready.

Fixes SCYLLADB-807
2026-05-13 08:43:24 +02:00
Robert Bindar
b52e40e512 export make_storage_options_config from lib/test_services
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
2f19d84ad7 Add system_distributed_keyspace.snapshot_sstables
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
    snapshot_name text,
    keyspace text, table text,
    datacenter text, rack text,
    id uuid,
    first_token bigint, last_token bigint,
    toc_name text, prefix text)
  PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.

Fixes SCYLLADB-263

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
31e9f04714 add get_system_distributed_keyspace to cql_test_env
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:17:40 +03:00
Botond Dénes
ad7ac62835 Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan  must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature

Closes scylladb/scylladb#29659

* github.com:scylladb/scylladb:
  db, sstables: add node_owner to sstables registry primary key
  db, sstables: rename sstables registry column owner to table_id
2026-05-11 14:08:19 +03:00
Botond Dénes
eae15f4fdd Merge 'Share timeout_config between services' from Pavel Emelyanov
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller.  Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.

This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.

Components' dependencies cleanup, not backporting.

Closes scylladb/scylladb#29636

* github.com:scylladb/scylladb:
  storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
  alternator: Use shared updateable_timeout_config by reference
  cql_transport: Use shared updateable_timeout_config by reference
  storage_proxy: Use shared updateable_timeout_config by reference
  main: Introduce sharded<updateable_timeout_config>
  storage_proxy: Keep own updateable_timeout_config
2026-05-11 11:12:01 +03:00
Botond Dénes
9b2dfab2e5 Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.

Code dependencies refactoring, not backporting

Closes scylladb/scylladb#29635

* github.com:scylladb/scylladb:
  view: Turn calculate_view_update_throttling_delay into node_update_backlog member
  view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
  view: Add node_update_backlog reference to view_update_generator
2026-05-11 10:30:24 +03:00
Avi Kivity
5a887362e3 Merge 'Remove legacy tables creation code' from Gleb Natapov
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.

No backport needed since this removes functionality.

Closes scylladb/scylladb#29482

* github.com:scylladb/scylladb:
  db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
  db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
  db/system_distributed_keyspace: remove unused code
  db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
  db/system_distributed_keyspace: drop old service_levels table
  fix indent after the previous patch
  group0: call setup_group0 only when needed
2026-05-10 14:46:21 +03:00
Ernest Zaslavsky
a4ebe16517 test: make sstable test utilities natively async
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
7c09f35ddf test: increase S3 max connections for compaction tests
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
2026-04-28 16:59:37 +03:00
Dimitrios Symonidis
c40842f60a db, sstables: add node_owner to sstables registry primary key
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
  PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.

The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
2026-04-24 16:41:09 +02:00
Dimitrios Symonidis
ce78c5113e db, sstables: rename sstables registry column owner to table_id
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.

This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
2026-04-24 16:24:07 +02:00
Pavel Emelyanov
aa99c1fd6e storage_proxy: Use shared updateable_timeout_config by reference
Drop storage_proxy's own updateable_timeout_config member built from
db::config and take a reference to the shared sharded instance
introduced by the previous patch. Both main and cql_test_env pass
std::ref(timeout_cfg) into storage_proxy::start so each shard's
storage_proxy references its shard-local updateable_timeout_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:07:21 +03:00
Pavel Emelyanov
7b7295fde0 main: Introduce sharded<updateable_timeout_config>
Build a single sharded updateable_timeout_config from db::config in
both main and cql_test_env, sitting next to sharded<cql_config>.
Subsequent patches migrate storage_proxy, the CQL transport controller
and alternator server from their per-owner updateable_timeout_config
copies to references into this shared instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:03:35 +03:00
Pavel Emelyanov
ec2339e635 view: Add node_update_backlog reference to view_update_generator
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:45:46 +03:00
Raphael S. Carvalho
474e962e01 compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
  only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
  compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
  all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
  repaired-only optimization is active; used by get_max_purgeable_timestamp() in
  compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
  is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
  process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
  D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
  tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
  compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
  landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
  writing T_base so D_base is staged on servers[0] via row-sync; blocks the
  view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
  (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
  in repaired set); asserts no resurrection; releases injection; waits for staging to
  complete; asserts no resurrection after a second flush+compaction. Demonstrates that
  the read-before-write in stream_view_replica_updates() makes the optimization safe even
  when staging fires after T_mv has been GC'd.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 16:59:09 -03:00
Botond Dénes
6ce0968960 compaction: release GC'ed sstables incrementally during compaction
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.

This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.

Fixes #5563

Closes scylladb/scylladb#28984
2026-04-17 18:20:47 +03:00
Avi Kivity
999e108139 Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.

Fixes SCYLLADB-1542

This is a CI stability issue and should be backported.

Closes scylladb/scylladb#29504

* github.com:scylladb/scylladb:
  test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
  test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
  test: fix proc_utils.cc formatting from previous commit
  test: lib: use unique container name per retry attempt
  test: lib: fix broken retry in start_docker_service
2026-04-16 21:48:25 +03:00
Botond Dénes
d006c4c476 Merge 'Untie (partially) cql3/statements from db::config' from Pavel Emelyanov
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config

Options migrated to cql_config:

   1. select_internal_page_size
   2. strict_allow_filtering
   3. enable_parallelized_aggregation
   4. batch_size_warn_threshold_in_kb
   5. batch_size_fail_threshold_in_kb
   6. 7 keyspace replication restriction options
   7. 2 TWCS restriction options
   8. restrict_future_timestamp
   9. strict_is_not_null_in_views (with view_restrictions struct)
   10. enable_create_table_with_compact_storage

Some options need special treatment and are still abused via database, namely:

  1. enable_logstor
  2. cluster_name
  3. partitioner
  4. endpoint_snitch

Fixing components inter-dependencies, not backporting

Closes scylladb/scylladb#29424

* github.com:scylladb/scylladb:
  cql3: Move enable_create_table_with_compact_storage to cql_config
  cql3: Move strict_is_not_null_in_views to cql_config
  cql3: Move restrict_future_timestamp to cql_config
  cql3: Move TWCS restriction options to cql_config
  cql3: Move keyspace restriction options to cql_config
  cql3: Move batch_size_fail_threshold_in_kb to cql_config
  cql3: Move batch_size_warn_threshold_in_kb to cql_config
  cql3: Move enable_parallelized_aggregation to cql_config
  cql3: Move strict_allow_filtering to cql_config
  cql3: Move select_internal_page_size to cql_config
  test: Fix cql_test_env to use updateable cql_config from db::config
  cql3: Add cql_config parameter to parsed_statement::prepare()
2026-04-16 14:04:43 +03:00
Dario Mirovic
50e498ac0d test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
Fix assorted typos in comments, strings, and identifiers:
- path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc)
- laúnch -> launch (proc_utils.cc)
- hand/fail -> hang/fail (dockerized_service.py)
- inconvinient -> inconvenient (dockerized_service.py)
- priviledges -> privileges (gcs_fixture.hh)
- remove double semicolon (gcs_fixture.cc)

Refs SCYLLADB-1542
2026-04-16 10:58:55 +02:00
Dario Mirovic
11b5997eaf test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).

Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.

Refs SCYLLADB-1542
2026-04-16 10:58:52 +02:00
Dario Mirovic
dc7f848bf8 test: fix proc_utils.cc formatting from previous commit
Fix indentation of lines moved inside the for-loop in
start_docker_service (lines 208-225).

Refs SCYLLADB-1542
2026-04-16 10:55:48 +02:00
Dario Mirovic
be4d32c474 test: lib: use unique container name per retry attempt
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".

Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.

Refs SCYLLADB-1542
2026-04-16 10:55:04 +02:00
Benny Halevy
ce00d61917 db: implement large_data virtual tables with feature flag gating
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).

The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:

- Before the feature is enabled: the old physical tables remain in
  all_tables(), CQL writes are active, no virtual tables are registered.
  This ensures safe rollback during rolling upgrades.

- After the feature is enabled: old physical tables are dropped from
  disk via legacy_drop_table_on_all_shards(), virtual tables are
  registered on all shards, and CQL writes are skipped via
  skip_cql_writes() in cql_table_large_data_handler.

Key implementation details:

- Three virtual table classes (large_partitions_virtual_table,
  large_rows_virtual_table, large_cells_virtual_table) extend
  streaming_virtual_table with cross-shard record collection.

- generate_legacy_id() gains a version parameter; virtual tables
  use version 1 to get different UUIDs than the old physical tables.

- compaction_time is derived from SSTable generation UUID at display
  time via UUID_gen::unix_timestamp().

- Legacy SSTables without LargeDataRecords emit synthetic summary
  rows based on above_threshold > 0 in LargeDataStats.

- The activation logic uses two paths: when the feature is already
  enabled (test env, restart), it runs as a coroutine; when not yet
  enabled, it registers a when_enabled callback that runs inside
  seastar::async from feature_service::enable().

- sstable_3_x_test updated to use a simplified large_data_test_handler
  and validate LargeDataRecords in SSTable metadata directly.
2026-04-16 08:49:02 +03:00
Benny Halevy
cb6004b625 db: call initialize_virtual_tables from shard 0 only
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.

This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
2026-04-16 08:49:02 +03:00
Benny Halevy
90d4ff34fb test: add LargeDataRecords round-trip unit tests
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():

- test_large_data_records_round_trip: verifies partition_size, row_size,
  and cell_size records are written with correct field semantics when
  thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
  keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
  are written when data is below all thresholds

Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
2026-04-16 08:49:02 +03:00
Pavel Emelyanov
728eb20b42 test: Fix cql_test_env to use updateable cql_config from db::config
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).

Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-04-16 07:57:26 +03:00
Dario Mirovic
336dab1eec test: lib: fix broken retry in start_docker_service
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.

Fixes SCYLLADB-1542
2026-04-15 15:25:52 +02:00
Gleb Natapov
0ef06a34ed group0: call setup_group0 only when needed
setup_group0 and setup_group0_if_exist have hidden condition inside that
make them no-op. It is not clear at the call site that functions may do
nothing. Change the code to check the conditions at the call site
instead.
2026-04-15 15:48:48 +03:00
Tomasz Grabiec
b6a7023f68 tablets: Prepare for non-power-of-two tablet count
This is a step towards more flexibility in managing tablets.  A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.

After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.

We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.

One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.

Example (3 tablets):
  [0] owns 1/3 of tokens
  [1] owns 1/3 of tokens
  [2] owns 1/3 of tokens

After merge:
  [0] owns 2/3 of tokens
  [1] owns 1/3 of tokens

What we would like instead:

Step 1 (split [1]):
  [0] owns 1/3 of tokens
  [1] old 1.left, owns 1/6 of tokens
  [2] old 1.right, owns 1/6 of tokens
  [3] owns 1/3 of tokens

Step 2 (merge):
  [0] owns 1/2 of tokens
  [1] owns 1/2 of tokens

To do that, we need to be able to split individual tablets, but we're
not there yet.
2026-04-15 10:40:55 +02:00
Avi Kivity
0ae22a09d4 LICENSE: Update to version 1.1
Updated terms of non-commercial use (must be a never-customer).
2026-04-12 19:46:33 +03:00
Avi Kivity
8ccee6803e Merge 'Remove upgrade view builder' from Gleb Natapov
Since we do no longer support upgrade from versions that do not support
v2 of "view building status" code (building status is managed by raft) we can remove v1 code and upgrade code and make sure we do not boot with old "builder status" version.

v2 version was introduced by 8d25a4d678 which is included in scylla-2025.1.0.

No backport needed since this is code removal.

Closes scylladb/scylladb#29105

* github.com:scylladb/scylladb:
  view: drop unused v1 builder code
  view: remove upgrade to raft code
2026-04-12 00:39:26 +03:00
Avi Kivity
ca80ee8586 Merge 'Introduce maintenance scheduling supergroup and do initial population' from Pavel Emelyanov
The supergroup replaces streaming (a.k.a. maintenance as well) group, inherits 200 shares from it and consists of four sub-groups (all have equal shares of 200 withing the new supergroup)

* maintenance_compaction. This group configures `compaction_manager::maintenance_sg()` group. User-triggered compaction runs in it
* backup. This group configures `snapshot_ctl::config::backup_sched_group`. Native backup activity runs there
* maintenance. It's a new "visible" name, everything that was called "maintenance" in the code ran in "streaming" group. Now it will run in "maintenance". The activities include those that don't communicate over RPC (see below why)
  * `tablet_allocator::balance_tablets()`
  * `sstables_manager::components_reclaim_reload_fiber()`
  * `tablet_storage_group_manager::merge_completion_fiber()`
  * metrics exporting http server altogether
* streaming. This is purely existing streaming group that just moves under the new supergroup. Everything else that was run there, continues doing so, including
  * hints sender
  * all view building related components (update generator, builder, workers)
  * repair
  * stream_manager
  * messaging service (except for verb handlers that switch groups)
  * join_cluster() activity
  * REST API
  * ... something else I forgot

The `--maintenance_io_throughput_mb_per_sec` option is introduced. It controls the IO throughput limit applied to the maintenance supergroup. If not set, the `--stream_io_throughput_mb_per_sec` option is used to preserve backward compatibility.

All new sched groups inherit `request_class::maintenance` (however, "backup" seem not to make any requests yet).

Moving more activities from "streaming" into "maintenance" (or its own group) is possible, but one will need to take care of RPC group switching. The thing is that when a client makes an RPC call, the server may switch to one of pre-negotiated scheduling groups. Verbs for existing activities that run in "streaming" group are routed through RPC index that negotiates "streaming" group on the server side. If any of that client code moves to some other group, server will still run the handlers in "streaming" which is not quite expected. That's one of the main reasons why only the selected fibers were moved to their own "maintenance" group. Similar for backup -- this code doesn't use RPC, so it can be moved. Restoring code uses load-and-stream and corresponding RPCs, so it cannot be just moved into its own new group.

Fixes SCYLLADB-351

New feature, not backporting

Closes scylladb/scylladb#28542

* github.com:scylladb/scylladb:
  code: Add maintenance/maintenance group
  backup: Add maintenance/backup group
  compaction: Add maintenance/maintenance_compaction group
  main: Introduce maintenance supergroup
  main: Move all maintenance sched group into streaming one
  database: Use local variable for current_scheduling_group
  code: Live-update IO throughputs from main
2026-04-12 00:34:48 +03:00
Pavel Emelyanov
58e59e8c0d Merge 'test: add test_sstable_clone_preserves_staging_state' from Benny Halevy
Add a test that verifies filesystem_storage::clone preserves the sstable
state: an sstable in staging is cloned to a new generation, the clone is
re-loaded from the staging directory, and its state is asserted to still
be staging.

The change proves that https://scylladb.atlassian.net/browse/SCYLLADB-1205
is invalid, and can be closed.

* No functional change and no backport needed

Closes scylladb/scylladb#29209

* github.com:scylladb/scylladb:
  test: add test_sstable_clone_preserves_staging_state
  test: derive sstable state from directory in test_env::make_sstable
  sstables: log debug message in filesystem_storage::clone
2026-04-07 17:02:04 +03:00
Ernest Zaslavsky
437a581b04 sstable_utils: add get_storage and open_file helpers
Add a non-const `get_storage` accessor to expose underlying storage,
and an `open_file` helper to access sstable component files directly.
These are needed so compaction tests can read and write sstable
components.
2026-04-05 11:07:17 +03:00
Ernest Zaslavsky
2ad2dbae03 test_env: delay unplugging sstable registry
Unplugging the mock sstable_registry happened too early in the test
environment. During sstable destruction, components may still need
access to the registry, so the unplugging is moved to a later stage.
2026-04-05 11:07:17 +03:00
Marcin Maliszkiewicz
f988ec18cb test/lib: fix port in-use detection in start_docker_service
Previously, the result of when_all was discarded. when_all stores
exceptions in the returned futures rather than throwing, so the outer
catch(in_use&) could never trigger. Now we capture the when_all result
and inspect each future individually to properly detect in_use from
either stream.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1216

Closes scylladb/scylladb#29219
2026-03-25 11:45:53 +02:00
Benny Halevy
22f2010477 test: derive sstable state from directory in test_env::make_sstable
Instead of always passing sstable_state::normal, infer the state from
the last component of the directory path by comparing against the known
state subdirectory constants (staging_dir, upload_dir, quarantine_dir).
Any unrecognized path component (the common case for normal-state
sstables) maps to sstable_state::normal.

When a non-normal state is detected, strip the state subdirectory from
dir so that the base table directory is passed to storage.
2026-03-24 16:48:01 +02:00
Pavel Emelyanov
3b9398dfc8 Merge 'encryption: fix deadlock in encrypted_data_source::get()' from Ernest Zaslavsky
When encrypted_data_source::get() caches a trailing block in _next, the next call takes it directly — bypassing input_stream::read(), which checks _eof. It then calls input_stream::read_exactly() on the already-drained stream. Unlike read(), read_up_to(), and consume(), read_exactly() does not check _eof when the buffer is empty, so it calls _fd.get() on a source that already returned EOS.

In production this manifested as stuck encrypted SSTable component downloads during tablet restore: the underlying chunked_download_source hung forever on the post-EOS get(), causing 4 tablets to never complete. The stuck files were always block-aligned sizes (8k, 12k) where _next gets populated and the source is fully consumed in the same call.

Fix by checking _input.eof() before calling read_exactly(). When the stream already reached EOF, buf2 is known to be empty, so the call is skipped entirely.

A comprehensive test is added that uses a strict_memory_source which fails on post-EOS get(), reproducing the exact code path that caused the production deadlock.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1128

Backport to 2025.3/4 and 2026.1 is needed since it fixes a bug that may bite us in production, to be on the safe side

Closes scylladb/scylladb#29110

* github.com:scylladb/scylladb:
  encryption: fix deadlock in encrypted_data_source::get()
  test_lib: mark `limiting_data_source_impl` as not `final`
  Fix formatting after previous patch
  Fix indentation after previous patch
  test_lib: make limiting_data_source_impl available to tests
2026-03-23 17:12:44 +03:00
Pavel Emelyanov
6f43e8562e compaction: Add maintenance/maintenance_compaction group
Compaction manager tells compaction_sched_group from
maintenance_compaction_sched_group. The latter, however, is set to be
"streaming" group. This patch adds real maintenance_compaction group
under the maintenance supergroup and makes compaction manager use it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-03-23 16:00:02 +03:00
Avi Kivity
6b259babeb Merge 'logstor: initial log-structured storage for key-value tables' from Michael Litvak
Introduce an initial and experimental implementation of an alternative log-structured storage engine for key-value tables.

Main flows and components:
* The storage is composed of 32MB files, each file divided to segments of size 128k. We write to them sequentially records that contain a mutation and additional metadata. Records are written to a buffer first and then written to the active segment sequentially in 4k sized blocks.
* The primary index in memory maps keys to their location on disk. It is a B-tree per-table that is ordered by tokens, similar to a memtable.
* On reads we calculate the key and look it up in the primary index, then read the mutation from disk with a single disk IO.
* On writes we write the record to a buffer, wait for it to be written to disk, then update the index with the new location, and free the previous record.
* We track the used space in each segment. When overwriting a record, we increase the free space counter for the segment of the previous record that becomes dead. We store the segments in a histogram by usage.
* The compaction process takes segments with low utilization, reads them and writes the live records to new segments, and frees the old segments.
* Segments are initially "mixed" - we write to the active segment records from all tables and all tablets. The "separator" process rewrites records from mixed segments into new segments that are organized by compaction groups (tablets), and frees the mixed segments. Each write is written to the active segment and to a separator buffer of the compaction group, which is eventually flushed to a new segment in the compaction group.

Currently this mode is experimental and requires an experimental flag to be enabled.
Some things that are not supported yet are strong consistency, tablet migration, tablet split/merge, big mutations, tombstone gc, ttl.

to use, add to config:
```
enable_logstor: true

experimental_features:
  - logstor
```

create a table:
```
CREATE TABLE ks.t(pk int PRIMARY KEY, a int, v text) WITH storage_engine = 'logstor';
```

INSERT, SELECT, DELETE work as expected
UPDATE not supported yet

no backport - new feature

Closes scylladb/scylladb#28706

* github.com:scylladb/scylladb:
  logstor: trigger separator flush for buffers that hold old segments
  docs/dev: add logstor documentation
  logstor: recover segments into compaction groups
  logstor: range read
  logstor: change index to btree by token per table
  logstor: move segments to replica::compaction_group
  db: update dirty mem limits dynamically
  logstor: track memory usage
  logstor: logstor stats api
  logstor: compaction buffer pool
  logstor: separator: flush buffer when full
  logstor: hold segment until index updates
  logstor: truncate table
  logstor: enable/disable compaction per table
  logstor: separator buffer pool
  test: logstor: add separator and compaction tests
  logstor: segment and separator barrier
  logstor: separator debt controller
  logstor: compaction controller
  logstor: recovery: recover mixed segments using separator
  logstor: wait for pending reads in compaction
  logstor: separator
  logstor: compaction groups
  logstor: cache files for read
  logstor: recovery: initial
  logstor: add segment generation
  logstor: reserve segments for compaction
  logstor: index: buckets
  logstor: add buffer header
  logstor: add group_id
  logstor: record generation
  logstor: generation utility
  logstor: use RIPEMD-160 for index key
  test: add test_logstor.py
  api: add logstor compaction trigger endpoint
  replica: add logstor to db
  schema: add logstor cf property
  logstor: initial commit
  db: disable tablet balancing with logstor
  db: add logstor experimental feature flag
2026-03-20 00:18:09 +02:00
Ernest Zaslavsky
f74a54f005 test_lib: mark limiting_data_source_impl as not final 2026-03-19 13:54:54 +02:00
Ernest Zaslavsky
151e945d9f Fix formatting after previous patch 2026-03-19 13:54:44 +02:00
Ernest Zaslavsky
537747cf5d Fix indentation after previous patch 2026-03-19 13:48:53 +02:00
Ernest Zaslavsky
2535164542 test_lib: make limiting_data_source_impl available to tests
Relocate the `limiting_data_source_impl` declaration to the header file
so that test code can access it directly.
2026-03-19 13:48:53 +02:00
Michael Litvak
2128b1b15c replica: add logstor to db
Add a single logstor instance in the database that is used for writing
and reading to tables with kv storage
2026-03-18 19:24:26 +01:00
Gleb Natapov
77d3245e02 view: remove upgrade to raft code
Since we do no longer support upgrade from versions that do not support
v2 of view building code we can remove upgrade code and make sure we do
not boot with old builder version.
2026-03-18 17:45:40 +02:00
Tomasz Grabiec
5ee61f067d test: cql_test_env: Respect enable-index-cache config
Mirrors the code in main.cc
2026-03-18 16:25:20 +01:00
Piotr Dulikowski
d8b283e1fb Merge 'Add CQL forwarding for strongly consistent tables' from Wojciech Mitros
In this series we add support for forwarding strongly consistent CQL requests to suitable replicas, so that clients can issue reads/writes to any node and have the request executed on an appropriate tablet replica (and, for writes, on the Raft leader). We return the same CQL response as what the user would get while sending the request to the correct replica and we perform the same logging/stats updates on the request coordinator as if the coordinator was the appropriate replica.

The core mechanism of forwarding a strongly consistent request is sending an RPC containing the user's cql request frame to the appropriate replica and returning back a ready, serialized `cql_transport::response`. We do this in the CQL server - it is most prepared for handling these types and forwarding a request containing a CQL frame allows us to reuse near-top-level methods for CQL request handling in the new RPC handler (such as the general `process`)

For sending the RPC, the CQL server needs to obtain the information about who should it forward the request to. This requires knowledge about the tablet raft group members and leader. We obtain this information during the execution of a `cql3/strong_consistency` statement, and we return this information back to the CQL server using the generalized `bounce_to_shard` `response_message`, where we now store the information about either a shard, or a specific replica to which we should forward to. Similarly to `bounce_to_shard`, we need to handle this `result_message` in a loop - a replica may move during statement execution, or the Raft leader can change. We also use it for forwarding strongly consistent writes when we're not a member of the affected tablet raft group - in that case we need to forward the statement twice - once to any replica of the affected tablet, then that replica can find the leader and return this information to the coordinator, which allows the second request to be directed to the leader.

This feature also allows passing through exception messages which happened on the target replica while executing the statement. For that, many methods of the `cql_transport::cql_server::connection` for creating error responses needed to be moved to `cql_transport::cql_server`. And for final exception handling on the coordinator, we added additional error info to the RPC response, so that the handling can be performed without having the `result_message::exception` or `exception_ptr` itself.

Fixes [SCYLLADB-71](https://scylladb.atlassian.net/browse/SCYLLADB-71)

[SCYLLADB-71]: https://scylladb.atlassian.net/browse/SCYLLADB-71?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27517

* github.com:scylladb/scylladb:
  test: add tests for CQL forwarding
  transport: enable CQL forwarding for strong consistency statements
  transport: add remote statement preparation for CQL forwarding
  transport: handle redirect responses in CQL forwarding
  transport: add exception handling for forwarded CQL requests
  transport: add basic CQL request forwarding
  idl: add a representation of client_state for forwarding
  cql_server: handle query, execute, batch in one case
  transport: inline process_on_shard in cql_server::process
  transport: extract process() to cql_server
  transport: add messaging_service to cql_server
  transport: add response reconstruction helpers for forwarding
  transport: generalize the bounce result message for bouncing to other nodes
  strong consistency: redirect requests to live replicas from the same rack
  transport: pass foreign_ptr into sleep_until_timeout_passes and move it to cql_server
  transport: extract the error handling from process_request_one
  transport: move error response helpers from connection to cql_server
2026-03-13 15:03:10 +01:00
Avi Kivity
b228eb26e6 Merge 'dbuild: Use slirp4netns network in dbuild nested containers' from Calle Wilund
Fixes #25084

Add slirp4netns and use for nested containers. This will allow nested container port aliasing, helping CI stability.

Note: this contains and updated Dockerfile for dbuild image, but since chicken and eggs, right now will force install slirp4netns before anything in dbuild script.

Updates the mock server handling to use ephemeral ports and query from container, ensuring we don't get port collisions. (boost as well as pytest).

Includes a timeout up, and a tweak to our scylla_cluster handling, ensuring we don't deadlock when pipe size is less than requires for our sys notify messages.

Closes scylladb/scylladb#28727

* github.com:scylladb/scylladb:
  gcs_fixture: Change to use docker helper
  aws_kms_fixture: Modify to use docker helper
  test/lib/proc_util: Add docker helper
  pytest: use ephemeral port publish for docker mock servers
  dbuild: Use container network in dbuild nested containers
  scylla_cluster: Read notify sock in background to prevent deadlock
2026-03-12 23:49:25 +02:00