Commit Graph

1559 Commits

Author SHA1 Message Date
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Ernest Zaslavsky
ddf2588985 treewide: Move replica related files to replica directory
As requested in #22099, moved the files and fixed other includes and build system.

Moved files:
- cache_temperature.hh
- cell_locking.hh

Fixes: #22099

Closes scylladb/scylladb#25079
2025-09-18 08:00:35 +03:00
Ernest Zaslavsky
54aa552af7 treewide: Move type related files to a type directory As requested in #22110, moved the files and fixed other includes and build system.
Moved files:
- duration.hh
- duration.cc
- concrete_types.hh

Fixes: #22110

This is a cleanup, no need to backport

Closes scylladb/scylladb#25088
2025-09-17 17:32:19 +03:00
Ernest Zaslavsky
d624413ddd treewide: Move query related files to a new query directory
As requested in #22120, moved the files and fixed other includes and build system.

Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh

Fixes: #22120

This is a cleanup, no need to backport

Closes scylladb/scylladb#25105
2025-09-16 23:40:47 +03:00
Michał Jadwiszczak
dc1ffd2c10 service/storage_service: drain view_building_worker earlier
Similarly to view builder, view building worker needs to be drained
in `storage_service::do_drain()`.

Storage service drain is happening at the same beginning of shutdown
procedure. Before this patch, the worker was still building views
after the storage service was drained and this caused errors like:
`Error applying view update to (named_gate_closed_exception)` and
`locator::no_such_tablet_map`.

Fixes scylladb/scylladb#25908

Closes scylladb/scylladb#25984
2025-09-15 11:29:19 +03:00
Avi Kivity
fc64333040 Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.

The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.

This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.

Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.

No backports needed, new functionality.

Closes scylladb/scylladb#25626

* github.com:scylladb/scylladb:
  test/manual: add bti_cassandra_compatibility_test
  test/lib/random_schema: add some constraints for generated uuid and time/date values
  test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
  test/boost: add bti_index_test
  sstables/writer: add an accessor for the current write position in Data.db
  sstables/trie: introduce bti_index_reader
  sstables/trie: add bti_partition_index_writer.cc
  sstables/trie: add bti_row_index_writer.cc
  utils/bit_cast: add a new overload of write_unaligned()
  sstables/trie: add trie_writer::add_partial()
  sstables/consumer: add read_56()
  sstables/trie: make bti_node_reader::page_ptr copy-constructible
  sstables: extract abstract_index_reader from index_reader.hh to its own header
  sstables/trie: add an accessor to the file_writer under bti_node_sink
  sstables/types: make `deletion_time::operator tombstone()` const
  sstables/types: add sstables::deletion_time::make_live()
  sstables/trie: fix a special case in max_offset_from_child
  sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
  sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
  test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
2025-09-10 21:48:52 +03:00
Michał Chojnowski
77dcb2bcda test/lib/random_schema: add some constraints for generated uuid and time/date values
I want to write a test which generates a random table (random schema,
random data) and uses the Python driver to query it.
But it turns out that some values generated by test/lib/random_schema
can't be deserialized by the Python driver.
For example, it doesn't unknown uuid versions, dates before year 1
of after year 9999, or `time` values greater or equal to the number
of nanoseconds in a day.

AFAIK those "driver-illegal" values aren't particularly interesting
for tests which use `random_schema`, so we can just not generate
them.
2025-09-07 00:32:02 +02:00
Michał Chojnowski
3ce7b761ce test/lib/random_utils: add a variant of get_bytes which takes an engine&
I will need it in a test later.
2025-09-07 00:32:02 +02:00
Calle Wilund
5ead6ec420 proc-utils: Re-export waiting types from seastar
Just to make directly accessible from wrapper type
2025-09-01 18:03:44 +00:00
Calle Wilund
8169327553 proc-utils: Inherit environment from current process
In most cases, when launching a process from tests, we will want to
inherit our own env. Add option (default true) to do so.
2025-09-01 18:03:44 +00:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Łukasz Paszkowski
9539e80e54 compaction_manager: Subscribe to out of space controller 2025-08-29 14:56:07 +02:00
Łukasz Paszkowski
3d03b88719 database: Add critical_disk_utilization mode database can be moved to
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.

The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
2025-08-29 13:46:45 +02:00
Łukasz Paszkowski
3e740d25b5 disk_space_monitor: add subscription API for threshold-based disk space monitoring
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.

The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
2025-08-28 18:06:37 +02:00
Michał Jadwiszczak
233f4dcee3 db/view/view_building_worker: register staging sstable to view building coordinator when needed
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.

Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
      to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
      to `view_update_generator` but create view building tasks for
      later

The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
2025-08-27 10:23:03 +02:00
Michał Jadwiszczak
d2e1b6d44a service/storage_proxy: expose references to system_keyspace and view_building_state_machine
Those references are needed to manage view building tasks while a view
is created/dropped.
2025-08-27 08:55:47 +02:00
Michał Jadwiszczak
f2e7051a84 service: reload view_building_state_machine on group0 apply()
The state may be also reloaded on `topology_change` or `mixed_change`
because topology coordinator may change view building tasks during
tablet operations.
2025-08-27 08:55:47 +02:00
Dawid Mędrek
dd5a35dc67 service/qos: Add auth::service to auth_integration
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.

Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.

In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.

That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.
2025-08-26 18:41:43 +02:00
Dawid Mędrek
7d0086b093 service/qos: Introduce auth_integration
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.

The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.

Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.

We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.
2025-08-26 18:41:34 +02:00
Dawid Mędrek
837d267cbf main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.
2025-08-21 19:35:33 +02:00
Botond Dénes
66db95c048 Merge 'Preserve PyKMIP logs from failed KMIP tests' from Nikos Dragazis
This PR extends the `tmpdir` class with an option to preserve the directory if the destructor is called during stack unwinding. It also uses this feature in KMIP tests, where the tmpdir contains PyKMIP server logs, which may be useful when diagnosing test failures.

Fixes #25339.

Not so important to be backported.

Closes scylladb/scylladb#25367

* github.com:scylladb/scylladb:
  encryption_at_rest_test: Preserve tmpdir from failing KMIP tests
  test/lib: Add option to preserve tmpdir on exception
2025-08-19 13:17:29 +03:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Avi Kivity
e9928b31b8 Merge 'sstables/trie: add BTI key translation routines' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25396
Next part: implementing sstable index writers and readers on top of the abstract trie writers/readers.

The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.

This series provides translation routines for ring positions and clustering positions
from Scylla's native in-memory structures to BTI's byte-comparable encoding.

This translation is performed whenever a new decorated key or clustering block
are added to a BTI index, and whenever a BTI index is queried for a range of positions.

For a description of the encoding, see
fad1f74570/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md (multi-component-sequences-partition-or-clustering-keys-tuples-bounds-and-nulls)

The translation logic, with all the fragment awareness, lazy
evaluation and avoidable copies, is fairly bloated for the common cases
of simple and small keys. This is a potential optimization target for later.

No backports needed, new functionality.

Closes scylladb/scylladb#25506

* github.com:scylladb/scylladb:
  sstables/trie: add BTI key translation routines
  tests/lib: extract generate_all_strings to test/lib
  tests/lib: extract nondeterministic_choice_stack to test/lib
  sstables/trie/trie_traversal: extract comparable_bytes_iterator to its own file
  sstables/mx: move clustering_info from writer.cc to types.hh
  sstables/trie: allow `comparable_bytes_iterator` to return a mutable span
  dht/ring_position: add ring_position_view::weight()
2025-08-18 11:55:26 +03:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Michał Chojnowski
5e76708335 tests/lib: extract generate_all_strings to test/lib
This util will be used in another test file in a later commit,
so hoist it to `test/lib`.
2025-08-14 22:38:38 +02:00
Avi Kivity
fe6e1071d3 Merge 'locator: util: optimize describe_ring' from Benny Halevy
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test for vnode keyspace
and for tablets.

Fixes #24887

* backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters

Closes scylladb/scylladb#24889

* github.com:scylladb/scylladb:
  locator: util: optimize describe_ring
  locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas
  vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map
  locator: effective_replication_map: get_natural_replicas: get is_vnode param
  test: cluster: test_repair: add test_vnode_keyspace_describe_ring
2025-08-14 19:39:17 +03:00
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Michał Chojnowski
72818a98e0 tests/lib: extract nondeterministic_choice_stack to test/lib
This util will be used in another test file in later commit,
so hoist it to `test/lib`.
2025-08-14 02:06:34 +02:00
Benny Halevy
50abeb1270 locator: util: optimize describe_ring
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint
information in an unordered_map instead of looking
them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per
node, yielding 11520 ranges and 9 replicas per range, describe_ring took
Before: 30 milliseconds (2.6 microseconds per range)
After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test of describe_ring for tablets.
A unit test for vnodes already exists in
test/nodetool/test_describering.py

Fixes #24887

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:42:25 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Benny Halevy
6cc964ef16 sstables: sstable_generation: get rid of uuid_identifiers bool class
Now that all call sites enable uuid_identifiers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
0ad1898f0a feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
The feature is supported by all live versions since
version 5.4 / 2024.1.

(Although up to 6da758d74c
it could be disabled using the config option)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:15 +03:00
Raphael S. Carvalho
beaaf00fac test: Add test that compaction doesn't cross logical group boundary
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:01 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
20c3301a1a treewide: Futurize estimation of pending compaction tasks
This is to allow futurization of compaction_group_view method that
retrieves sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
2c4a9ba70c treewide: Rename table_state to compaction_group_view
Since table_state is a view to a compaction group, it makes sense
to rename it as so.

With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:28 +03:00
Benny Halevy
9b65856a26 test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
It's a generic helper that can be used by all tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Benny Halevy
7c9ce235d7 test: database_test: move table_dir helper to test/lib/test_utils
It's a generic helper that can be used by all tests.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-07 12:04:23 +03:00
Nikos Dragazis
1eb99fb5f5 test/lib: Add option to preserve tmpdir on exception
Extend the tmpdir class with an option to preserve the directory if the
destructor is called during stack unwinding (i.e., uncaught exception).
To be used in tests where the tmpdir contains non-temporary resources
that may help in diagnosing test failures (e.g., logs from external
services such as PyKMIP).

This will be used in the next patch.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-08-06 13:07:52 +03:00
Patryk Jędrzejczak
3299ffba51 Merge 'raft_group0: split shutdown into abort-and-drain and destroy' from Petr Gusev
Previously, `raft_group0::abort()` was called in `storage_service::do_drain` (introduced in #24418) to stop the group0 Raft server before destroying local storage. This was necessary because `raft::server` depends on storage (via `raft_sys_table_storage` and `group0_state_machine`).

However, this caused issues: services like `sstable_dict_autotrainer` and `auth::service`, which use `group0_client` but are not stopped by `storage_service`, could trigger use-after-free if `raft_group0` was destroyed too early. This can happen both during normal shutdown and when 'nodetool drain' is used.

This PR reworks the shutdown logic:
* Introduces `abort_and_drain()`, which aborts the server and waits for background tasks to finish, but keeps the server object alive. Clients will see `raft::stopped_error` if they try to access group0 after this method is called.
* Final destruction now happens in `abort_and_destroy()`, called later from `main.cc`, ensuring safe cleanup.

The `raft_server_for_group::aborted` is changed to a `shared_future`, as it is now awaited in both abort methods.

Node startup can fail before reaching `storage_service`, in which case `drain_on_shutdown()` and `abort_and_drain()` are never called. To ensure proper cleanup, `raft_group0` deinitialization logic must be included in both `abort_and_drain()` and `abort_and_destroy()`.

Refs #25115

Fixes #24625

Backport: the changes are complicated and not safe to backport, we'll backport a revert of the original patch (#24418) in a separate PR.

Closes scylladb/scylladb#25151

* https://github.com/scylladb/scylladb:
  raft_group0: split shutdown into abort_and_drain and destroy
  Revert "main.cc: fix group0 shutdown order"
2025-07-29 10:39:00 +02:00
Botond Dénes
f3ed27bd9e Merge 'Move feature-service config creation code out of feature-service itself' from Pavel Emelyanov
Nowadays the way to configure an internal service is

1. service declares its config struct
2. caller (main/test/tool) fills the respective config with values it wants
3. the service is started with the config passed by value

The feature service code behaves likewise, but provides a helper method to create its config out of db::config. This PR moves this helper out of gms code, so that it doesn't mess with system-wide db::config and only needs its own small struct feature_config.

For the reference: similar changes with other services: #23705 , #20174 , #19166

Closes scylladb/scylladb#25118

* github.com:scylladb/scylladb:
  gms,init: Move get_disabled_features_from_db_config() from gms
  code: Update callers generating feature service config
  gms: Make feature_config a simple struct
  gms: Split feature_config_from_db_config() into two
2025-07-29 08:17:49 +03:00
Nadav Har'El
b4fc3578fc Merge 'LWT: enable for tablet-based tables' from Petr Gusev
This PR enables **LWT (Lightweight Transactions)** support for tablet-based tables by leveraging **colocated tables**.

Currently, storing Paxos state in system tables causes two major issues:
* **Loss of Paxos state during tablet migration or base table rebuilds**
  * When a tablet is migrated or the base table is rebuilt, system tables don't retain Paxos state.
  * This breaks LWT correctness in certain scenarios.
  * Failing test cases demonstrating this:
      * test_lwt_state_is_preserved_on_tablet_migration
      * test_lwt_state_is_preserved_on_rebuild
* **Shard misalignment and performance overhead**
  * Tablets may be placed on arbitrary shards by the tablet balancer.
  * Accessing Paxos state in system tables could require a shard jump, degrading performance.

We move Paxos state into a dedicated Paxos table, colocated with the base table:
  * Each base table gets its own Paxos state table.
  * This table is lazily created on the first LWT operation.
  * Its tablets are colocated with those of the base table, ensuring:
    * Co-migration during tablet movement
    * Co-rebuilding with the base table
    * Shard alignment for local access to Paxos state

Some reasoning for why this is sufficient to preserve LWT correctness is discussed in [2].

This PR addresses two issues from the "Why doesn't it work for tablets" section  in [1]:
  * Tablet migration vs LWT correctness
  * Paxos table sharding

Other issues ("bounce to shard" and "locking for intranode_migration") have already been resolved in previous PRs.

References
[1] - [LWT over tablets design](https://docs.google.com/document/d/1CPm0N9XFUcZ8zILpTkfP5O4EtlwGsXg_TU4-1m7dTuM/edit?tab=t.0#heading=h.goufx7gx24yu)
[2] - [LWT: Paxos state and tablet balancer](https://docs.google.com/document/d/1-xubDo612GGgguc0khCj5ukmMGgLGCLWLIeG6GtHTY4/edit?tab=t.0)
[3] - [Colocated tables PR](https://github.com/scylladb/scylladb/pull/22906#issuecomment-3027123886)
[4] - [Possible LWT consistency violations after a topology change](https://github.com/scylladb/scylladb/issues/5251)

Backport: not needed because this is a new feature.

Closes scylladb/scylladb#24819

* github.com:scylladb/scylladb:
  create_keyspace: fix warning for tablets
  docs: fix lwt.rst
  docs: fix tablets.rst
  alternator: enable LWT
  random_failures: enable execute_lwt_transaction
  test_tablets_lwt: add test_paxos_state_table_permissions
  test_tablets_lwt: add test_lwt_for_tablets_is_not_supported_without_raft
  test_tablets_lwt: test timeout creating paxos state table
  test_tablets_lwt: add test_lwt_concurrent_base_table_recreation
  test_tablets_lwt: add test_lwt_state_is_preserved_on_rebuild
  test_tablets_lwt: migrate test_lwt_support_with_tablets
  test_tablets_lwt: add test_lwt_state_is_preserved_on_tablet_migration
  test_tablets_lwt: add simple test for LWT
  check_internal_table_permissions: handle Paxos state tables
  client_state: extract check_internal_table_permissions
  paxos_store: handle base table removal
  database: get_base_table_for_tablet_colocation: handle paxos state table
  paxos_state: use node_local_only mode to access paxos state
  query_options: add node_local_only mode
  storage_proxy: handle node_local_only in query
  storage_proxy: handle node_local_only in mutate
  storage_proxy: introduce node_local_only flag
  abstract_replication_strategy: remove unused using
  storage_proxy: add coordinator_mutate_options
  storage_proxy: rename create_write_response_handler -> make_write_response_handler
  storage_proxy: simplify mutate_prepare
  paxos_state: lazily create paxos state table
  migration_manager: add timeout to start_group0_operation and announce
  paxos_store: use non-internal queries
  qp: make make_internal_options public
  paxos_store: conditional cf_id filter
  paxos_store: coroutinize
  feature_service: add LWT_WITH_TABLETS feature
  paxos_state: inline system_keyspace functions into paxos_store
  paxos_state: extract state access functions into paxos_store
2025-07-28 13:19:23 +03:00
Avi Kivity
1930f3e67f Merge 'sstables/mx/reader: accommodate inexact partition indexes' from Michał Chojnowski
Unlike the currently-used sstable index files, BTI indexes don't store the entire partition keys. They only store prefixes of decorated keys, up to the minimum length needed to differentiate a key from its neighbours in the sstable. This saves space.

However, it means that a BTI index query might be off by one partition (on each end of the queried partition range) with respect to the optimal Data position.

For example, if the index stores prefixes `a`, `b`, `c`,
the index has no way to know if the first index entry after key `bb`
is `b` (which might correspond to `ba` as well as `bc`), or `c`.
So the index reader conservatively has to pick the wider Data range, and the Data reader must ignore the superfluous partitions. (And there's no way around that.)

Before this patch, the sstable reader expects the index query to return an exact (optimal) Data range. This patch adjusts the logic of the sstable reader to allow for inexact ranges.

Note: the patch is more complicated that it looks. The logic of the sstable reader was already fairly hard to follow and this adds even more flags, more weird special states and more edge cases. I think I managed to write a decent test and it did find three or four edge cases I wouldn't have noticed otherwise. I think it should cover all the added logic, but I didn't verify code coverage. (Do our scripts for that even work nowadays)? Simplification ideas are welcome.

Preparation for new functionality, no backporting needed.

Closes scylladb/scylladb#25093

* github.com:scylladb/scylladb:
  sstables/index_reader: weaken some exactness guarantees in abstract_index_reader
  test/boost: add a test for inexact index lookups
  sstables/mx/reader: allow passing a custom index reader to the constructor
  sstables/index_reader: remove advance_to
  sstables/mx/reader: handle inexact lookups in `advance_context()`
  sstables/mx/reader: handle inexact lookups in `advance_to_next_partition()`
  sstables/index_reader: make the return value of `get_partition_key` optional
  sstables/mx/reader: handle "backward jumps" in forward_to
  sstables/mx/reader: filter out partitions outside the queried range
  sstables/mx/reader: update _pr after `fast_forward_to`
2025-07-27 19:39:36 +03:00
Petr Gusev
8b8b7adbe5 raft_group0: split shutdown into abort_and_drain and destroy
Previously, raft_group0::abort() was called in
storage_service::do_drain (introduced in #24418) to
stop the group0 Raft server before destroying local storage.
This was necessary because raft::server depends on storage
(via raft_sys_table_storage and group0_state_machine).

However, this caused issues: services like
sstable_dict_autotrainer and auth::service, which use
group0_client but are not stopped by storage_service,
could trigger use-after-free if raft_group0 was destroyed
too early. This can happen both during normal shutdown
and when 'nodetool drain' is used.

This commit reworks the shutdown logic:
* Introduces abort_and_drain(), which aborts the server
and waits for background tasks to finish, but keeps the
server object alive. Clients will see raft::stopped_error if
they try to access group0 after abort_and_drain().
* Final destruction happens in a separate method destroy(),
called later from main.cc.

The raft_server_for_group::aborted is changed to a
shared_future -- abort_server now returns a future so that
we can wait for it in abort_and_drain(), it should return
the future from the previous abort_server call, which can
happen in the on_background_error callback.

Node startup can fail before reaching storage_service,
in which case ss.drain_on_shutdown() and abort_and_drain()
are never called. To ensure proper cleanup,
abort_and_drain() is called from main.cc before destroy().

Clients of raft_group_registry are expected to call
destroy_server() for the servers they own. Currently,
the only such client is raft_group0, which satisfies
this requirement. As a result,
raft_group_registry::stop_servers() is no longer needed.
Instead, raft_group_registry::stop() now verifies that all
servers have been properly destroyed.
If any remain, it calls on_internal_error().

The call to drain_on_shutdown() in cql_test_env.cc
appears redundant. The only source of raft::server
instances in raft_group_registry is group0_service, and
if group0_service.start() succeeds, both abort_and_drain()
and destroy() are guaranteed to be called during shutdown.
2025-07-25 17:16:14 +02:00
Michał Chojnowski
141895f9eb sstables/index_reader: make the return value of get_partition_key optional
BTI indexes only store encoded prefixes of partition keys,
not the whole keys. They can't reliably implement `get_partition_key`.
The index reader interface must be weakened and callers must
be adapted.
2025-07-25 11:00:18 +02:00
Ernest Zaslavsky
d2c5765a6b treewide: Move keys related files to a new keys directory
As requested in #22102, #22103 and #22105 moved the files and fixed other includes and build system.

Moved files:
- clustering_bounds_comparator.hh
- keys.cc
- keys.hh
- clustering_interval_set.hh
- clustering_key_filter.hh
- clustering_ranges_walker.hh
- compound_compat.hh
- compound.hh
- full_position.hh

Fixes: #22102
Fixes: #22103
Fixes: #22105

Closes scylladb/scylladb#25082
2025-07-25 10:45:32 +03:00
Petr Gusev
ac4bc3f816 paxos_state: lazily create paxos state table
We call paxos_store::ensure_initialized in the beginning of
storage_proxy::cas to create a paxos state table for a user table if
it doesn't exist. When the LWT coordinator sends RPCs to replicas,
some of them may not yet have the paxos schema. In
paxos_store::get_paxos_state_schema we just wait for them to appear,
or throw 'no_such_column_family' if the base table was dropped.
2025-07-24 19:48:08 +02:00
Petr Gusev
6e87a6cdb0 paxos_state: extract state access functions into paxos_store
Introduce paxos_store abstraction to isolate Paxos state access.
Prepares for supporting either system.paxos or a co-located
table as the storage backend.
2025-07-24 16:39:50 +02:00