Commit Graph

4758 Commits

Author SHA1 Message Date
Botond Dénes
3158e9b017 doc: reorganize properties in config.cc and config.hh
This commit moves the "Ungrouped properties" category to the end of the
properties list. The properties are now published in the documentation,
and it doesn't look good if the list starts with ungrouped properties.

This patch was taken over from Anna Stuchlik <anna.stuchlik@scylladb.com>.

Closes scylladb/scylladb#28343
2026-01-29 11:27:42 +03:00
Pavel Emelyanov
937d008d3c Merge 'Clean up partition_snapshot_reader' from Botond Dénes
Move to `replica/`, drop `flat` from name and drop unused usages as well as unused includes.

Code cleanup, no backport

Closes scylladb/scylladb#28353

* github.com:scylladb/scylladb:
  replica/partition_snapshot_reader: remove unused includes
  partition_snapshot_reader: remove "flat" from name
  mv partition_snapshot_reader.hh -> replica/
2026-01-29 11:22:15 +03:00
Botond Dénes
482ffe06fd Merge 'Improve load shedding on the replica side' from Łukasz Paszkowski
When reads arrive, they have to wait for admission on the reader
concurrency semaphore. If the node is overloaded, the reads will
be queued. They can time out while in the queue, but will not time
out once admitted.

Once the shard is sufficiently loaded, it is possible that most
queued reads will time out, because the average time it takes to
for a queued read to be admitted is around that of the timeout.

If a read times out, any work we already did, or are about to do
on it is wasted effort. Therefore, the patch tries to prevent it
by checking if an admitted read has a chance to complete in time
and abort it if not. It uses the following criteria:

if read's remaining time <= read's timeout when arrived to the semaphore * live updateable preemptive_abort_factor;
the read is rejected and the next one from the wait list is considered.

Fixes https://github.com/scylladb/scylladb/issues/14909
Fixes: SCYLLADB-353

Backport is not needed. Better to first observe its impact.

Closes scylladb/scylladb#21649

* github.com:scylladb/scylladb:
  reader_concurrency_semaphore: Check during admission if read may timeout
  permit_reader::impl: Replace break with return after evicting inactive permit on timeout
  reader_concurrency_semaphore: Add preemptive_abort_factor to constructors
  config: Add parameters to control reads' preemptive_abort_factor
  permit_reader: Add a new state: preemptive_aborted
  reader_concurrency_semaphore: validate waiters counter when dequeueing a waiting permit
  reader_concurrency_semaphore: Remove cpu_concurrency's default value
2026-01-29 08:27:22 +02:00
Piotr Dulikowski
ec6a2661de Merge 'Keep view_builder background fiber in maintenance scheduling group' from Pavel Emelyanov
In fact, it's partially there already. When view_builder::start() is called is first calls initialization code (the start_in_background() method), then kicks do_build_step() that runs a background fiber to perform build steps. The starting code inherits scheduling group from main(). And the step fiber code needs to run itself in a maintenance scheduling group, so it explicitly grabs one via database->db_config.

This PR mainly gets rid of the call to database::get_streaming_scheduling_group() from do_build_step() as preparation to splitting the streaming scheduling group into parts (see SCYLLADB-351). To make it happen the do_build_step() is patched to inherit its scheduling group from view_builder::start() and the start() itself is called by main from maintenance scheduling group (like for other view building services).

New feature (nested scheduling group), not backporting

Closes scylladb/scylladb#28386

* github.com:scylladb/scylladb:
  view_builder: Start background in maintenance group
  view_builder: Wake-up step fiber with condition variable
2026-01-28 20:49:19 +01:00
Pavel Emelyanov
3ebd02513a view_builder: Start background in maintenance group
Currently view_builder::start() is called in default scheduling group.
Once it initializes itself, it wakes up the step fiber that explicitly
switches to maintenance scheduling group.

This explicit switch made sence before previous patch, when the fiber
was implemented as a serialized action. Now the fiber starts directly
from .start() method and can inherit scheduling group from it.

Said that, main code calls view_builder::start() in maintenance
scheduling group killing two birds with one stone. First, the step fiber
no longer needs borrow its scheduling group indirectly via database.
Second, the start_in_background() code itself runs in a more suitable
scheduling group.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:59 +03:00
Pavel Emelyanov
2439d27b60 view_builder: Wake-up step fiber with condition variable
View builder runs a background fiber that perform build steps. To kick
the fiber it uses serizlized action, but it's an overkill -- nobody
waits for the action to finish, but on stop, when it's joined.

This patch uses condition variable to kick the fiber, and starts it
instantly, in the place where serialized action was first kicked.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-28 18:34:58 +03:00
Łukasz Paszkowski
21348050e8 config: Add parameters to control reads' preemptive_abort_factor 2026-01-28 14:20:01 +01:00
Botond Dénes
ee631f31a0 Merge 'Do not export system keyspace from raft_group0_client' from Pavel Emelyanov
There are few places that use raft_group0_client as a way to get to system_keyspace. Mostly they can live without it -- either the needed reference is already at hand, or it's (ab)used to get to the database reference. The only place that really needs the system keyspace is the state merger code that needs last state ID. For that, the explicit helper method is added to group0_client.

Refining API between components, not backporting

Closes scylladb/scylladb#28387

* github.com:scylladb/scylladb:
  raft_group0_client: Dont export system keyspace
  raft_group0_client: Add and use get_last_group0_state_id()
  group0_state_machine: Call ensure_group0_sched() with data_dictionary
  view_building_worker: Use its own system_keyspace& reference
2026-01-28 13:24:32 +02:00
Emil Maskovsky
834961c308 db/view: add missing include for coroutine::all to fix build without precompiled headers
When building with `--disable-precompiled-header`, view.cc failed to
compile due to missing <seastar/coroutine/all.hh> include, which provides
`coroutine::all`.

The problem doesn't manifest when precompiled headers are used, which is
the default. So that's likely why it was missed by the CI.

Adding the explicit include fixes the build.

Fixes: scylladb/scylladb#28378
Ref: scylladb/scylladb#28093

No backport: This problem is only present in master.

Closes scylladb/scylladb#28379
2026-01-27 18:56:56 +01:00
Pavel Emelyanov
20a2b944df view_building_worker: Use its own system_keyspace& reference
Some code in the worker need to mess with system_keyspace&. While
there's a reference on it from the worker object, it gets one via
group0 -> group0_client, which is a bit an overkill.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-01-27 14:46:48 +03:00
Pavel Emelyanov
c61d855250 hints: Provide explicit scheduling group for hint_sender
Currently it grabs one from database, but it's not nice to use database
as config/sched-groups provider.

This PR passes the scheduling group to use for sending hints via manager
which, in turn, gets one from proxy via its config (proxy config already
carries configuration for hints manager). The group is initialized in
main.cc code and is set to the maintenance one (nowadays it's the same
as streaming group).

This will help splitting the streaming scheduling group into more
elaborated groups under the maintenance supergroup: SCYLLADB-351

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28358
2026-01-27 12:50:11 +02:00
Piotr Dulikowski
5d5e829107 Merge 'db: view: refactor usage and building of semaphore in create and drop views plus change continuation to co routine style' from Alex Dathskovsky
db: view: refactor semaphore usage in create/drop view paths
Refactor the construction and usage of semaphore units in the create and drop view flows.
The previous semaphore handling was hard to follow (as noted while working on https://github.com/scylladb/scylladb/pull/27929), so this change restructures unit creation and movement to follow a clearer and symmetric pattern across shards.

The semaphore usage model is now documented with a detailed in-code comment to make the intended behavior and invariants explicit.

As part of the refactor, the control flow is modernized by replacing continuation-based logic with coroutine-style code, improving readability and maintainability.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-250

backport: not required, this is a refactor

Closes scylladb/scylladb#28093

* github.com:scylladb/scylladb:
  db: view: extend try/catch scope in handle_create_view_local The try/catch region is extended to cover step functions and inner helpers, which may throw or abort during view creation. This change is safe because we are just swolowing more parts that may throw due to semaphore abortion or any other abortion request, and doesnt change the logic
  db: view: refine create/drop coroutine signatures Refactor the create/drop coroutine interfaces to accept parameters as const references, enabling a clearer workflow and safer data flow.
  db: view: switch from continuations to coroutines Refactor the flow and style of create and drop view to use coroutines instead of continuations. This simplifies the logic, improves readability, and makes the code easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit. this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
  db: view: introduce helper to acquire or reuse semaphore units Introduce a small helper that acquires semaphore units when needed or reuses units provided by the caller. This centralizes semaphore handling, simplifies the current logic, and enables refactoring the view create/drop path to a coroutine-based implementation instead of continuation-style code.
  db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0
2026-01-26 17:16:01 +01:00
Botond Dénes
756837c5b4 partition_snapshot_reader: remove "flat" from name
The "flat" migration is long done, this distinction is no longer
meaningful.
2026-01-26 16:52:46 +02:00
Botond Dénes
9d1933492a mv partition_snapshot_reader.hh -> replica/
The partition snapshot lives in mutation/, however mutation/ is a lower
level concept than a mutation reader. The next best place for this
reader is the replica/ directory, where the memtable, its main user,
also lives.

Also move the code to the replica namespace.

test/boost/mvcc_test.cc includes this header but doesn't use anything
from it. Instead of updating the include path, just drop the unused
include.
2026-01-26 16:52:08 +02:00
Alex
954d18903e db: view: extend try/catch scope in handle_create_view_local
The try/catch region is extended to cover step functions and inner helpers,
which may throw or abort during view creation.
This change is safe because we are just swolowing more parts that may throw due to semaphore abortion
or any other abortion request, and doesnt change the logic
2026-01-26 13:10:37 +02:00
Alex
2c3ab8490c db: view: refine create/drop coroutine signatures
Refactor the create/drop coroutine interfaces to accept parameters as
const references, enabling a clearer workflow and safer data flow.
2026-01-26 13:10:34 +02:00
Alex
114f88cb9b db: view: switch from continuations to coroutines
Refactor the flow and style of create and drop view to use coroutines instead of continuations.
This simplifies the logic, improves readability, and makes the code
easier to maintain and extend. This commit also utilizes the get_view_builder_units function that was added in the previous commit.
this commit also introduces a new alisasing for optional unit type for simpler and more readable functions that use this type
2026-01-26 13:08:24 +02:00
Alex
87c1c6f40f db: view: introduce helper to acquire or reuse semaphore units
Introduce a small helper that acquires semaphore units when needed or
reuses units provided by the caller.
This centralizes semaphore handling, simplifies the current logic, and
enables refactoring the view create/drop path to a coroutine-based
implementation instead of continuation-style code.
2026-01-26 13:03:26 +02:00
Alex
1aadedc596 db: view: add detailed comments on semaphore bookkeeping and serialized create/drop on shard 0 2026-01-25 14:29:09 +02:00
Piotr Dulikowski
fe9237fdc9 Merge 'alternator: don't require rf_rack flag for indexes, validate instead' from Michael Litvak
In 8df61f6d99 we changed the requirements for creating materialized
views and MV-based indexes - instead of requiring the
rf_rack_valid_keyspaces flag to be set, we now require the keyspace to
be RF-rack-valid at the time of creation, and it is enforced to remain
RF-rack-valid while the MV exists. This validation is done in the cql
create view/index statements.

The same should be done also for alternator - when creating a table with
GSI or LSI, or when adding a GSI to an existing table, previously we
required the flag rf_rack_valid_keyspaces to be set. Now we change it to
instead check if the keyspace is RF-rack-valid, and if not the operation
fails with an appropriate error.

Fixes https://github.com/scylladb/scylladb/issues/28214

backport to 2025.4 to add RF-rack-valid enforcements in alternator

Closes scylladb/scylladb#28154

* github.com:scylladb/scylladb:
  locator: document the exception type of assert_rf_rack_valid_keyspace
  alternator: don't require rf_rack flag for indexes, validate instead
2026-01-23 11:49:02 +01:00
Botond Dénes
c4c2f87be7 Merge 'db: fail reads and writes with local consistency level to a DC with RF=0' from null
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893

A minor change, no need to backport.

Closes scylladb/scylladb#27894

* github.com:scylladb/scylladb:
  db: fail reads and writes with local consistencty level to a DC with RF=0
  db: consistency_level: split `local_quorum_for()`
  db: consistency_level: fix nrs -> nts abbreviation
2026-01-23 12:36:20 +02:00
Patryk Jędrzejczak
4e984139b2 Merge 'strongly consistent tables: basic implementation' from Petr Gusev
In this PR we add a basic implementation of the strongly-consistent tables:
* generate raft group id when a strongly-consistent table is created
* persist it into system.tables table
* start raft groups on replicas when a strongly-consistent tablet_map reaches them
* add strongly-consistent version of the storage_proxy, with the `query` and `mutate` methods
* the `mutate` method submits a command to the tablets raft group, the query method reads the data with `raft.read_barrier()`
* strongly-consistent versions of the `select_statement` and `modification_statement` are added
* a basic `test_strong_consistency.py/test_basic_write_read` is added which to check that we can write and read data in a strongly consistent fashion.

Limitations:
* for now the strongly consistent tables can have tablets only on shard zero. This is because we (ab/re) use the existing raft system tables which live only on shard0. In the next PRs we'll create separate tables for the new tablets raft groups.
* No Scylla-side proxying - the test has to figure out who is the leader and submit the command to the right node. This will be fixed separately.
* No tablet balancing -- migration/split/merges require separate complicated code.

The new behavior is hidden behind `STRONGLY_CONSISTENT_TABLES` feature, which is enabled when the `STRONGLY_CONSISTENT_TABLES` experimental feature flag is set.

Requirements, specs and general overview of the feature can be found [here](https://scylladb.atlassian.net/wiki/spaces/RND/pages/91422722/Strong+Consistency). Short term implementation plan is [here](https://docs.google.com/document/d/1afKeeHaCkKxER7IThHkaAQlh2JWpbqhFLIQ3CzmiXhI/edit?tab=t.0#heading=h.thkorgfek290)

One can check the strongly consistent writes and reads locally via cqlsh:
scylla.yaml:
```
experimental_features:
  - strongly-consistent-tables
```

cqlsh:
```
CREATE KEYSPACE IF NOT EXISTS my_ks WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 1} AND tablets = {'initial': 1} AND consistency = 'local';
CREATE TABLE my_ks.test (pk int PRIMARY KEY, c int);
INSERT INTO my_ks.test (pk, c) VALUES (10, 20);
SELECT * FROM my_ks.test WHERE pk = 10;
```

Fixes SCYLLADB-34
Fixes SCYLLADB-32
Fixes SCYLLADB-31
Fixes SCYLLADB-33
Fixes SCYLLADB-56

backport: no need

Closes scylladb/scylladb#27614

* https://github.com/scylladb/scylladb:
  test_encryption: capture stderr
  test/cluster: add test_strong_consistency.py
  raft_group_registry: disable metrics for non-0 groups
  strong consistency: implement select_statement::do_execute()
  cql: add select_statement.cc
  strong consistency: implement coordinator::query()
  cql: add modification_statement
  cql: add statement_helpers
  strong consistency: implement coordinator::mutate()
  raft.hh: make server::wait_for_leader() public
  strong_consistency: add coordinator
  modification_statement: make get_timeout public
  strong_consistency: add groups_manager
  strong_consistency: add state_machine and raft_command
  table: add get_max_timestamp_for_tablet
  tablets: generate raft group_id-s for new table
  tablet_replication_strategy: add consistency field
  tablets: add raft_group_id
  modification_statement: remove virtual where it's not needed
  modification_statement: inline prepare_statement()
  system_keyspace: disable tablet_balancing for strongly_consistent_tables
  cql: rename strongly_consistent statements to broadcast statements
2026-01-23 09:52:33 +01:00
Michael Litvak
d5009882c6 locator: document the exception type of assert_rf_rack_valid_keyspace
The function assert_rf_rack_valid_keyspace uses the exception type
std::invalid_argument when the RF-rack validation fails. Document it and
change all callers to catch this specific exception type when checking
for RF-rack validation failures, so that other exception types can be
propagated properly.
2026-01-22 16:11:35 +01:00
Pavel Emelyanov
cb6ee05391 Merge 'Extend snapshot manifest.json with tablet-aware metadata' from Benny Halevy
This series extends the json manifest file we create when taking snapshots.
It adds the following metadata:
- manifesr version and scope
- snapshot name
- created_at and expires_at timestamps (#24061)
- node metadata (host_id, dc, rack)
- keyspace and table metadat
- tablet_count (#26352)
- per-sstable metadata (#26352)

Fixes [SCYLLADB-189](https://scylladb.atlassian.net/browse/SCYLLADB-189)
Fixes [SCYLLADB-195](https://scylladb.atlassian.net/browse/SCYLLADB-195)
Fixes [SCYLLADB-196](https://scylladb.atlassian.net/browse/SCYLLADB-196)

* Enhancement, no backport needed

[SCYLLADB-189]: https://scylladb.atlassian.net/browse/SCYLLADB-189?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-195]: https://scylladb.atlassian.net/browse/SCYLLADB-195?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ
[SCYLLADB-196]: https://scylladb.atlassian.net/browse/SCYLLADB-196?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#27945

* github.com:scylladb/scylladb:
  snapshot: keep per-sstable metadata in manifest.json
  snapshot: add table info and tablet_count to manifest.json
  snapshot: add basic support for snapshot ttl in manifest.json
  table: snapshot_on_all_shards: take snapshot_options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
  snapshot: add snapshot name in manifest.json
  test: lib: cql_test_env: apply db::config::tablets_mode_for_new_keyspaces
  snapshot: add node info to manifest.json
  snapshot: add manifest info to manifest.json
  test: database_test: snapshot_works: add validate_manifest
2026-01-22 15:19:11 +03:00
Patryk Jędrzejczak
67045b5f17 Merge 'raft_topology, tablets: Drain tablets in parallel with other topology operations' from Tomasz Grabiec
Allows other topology operations to execute while tablets are being
drained on decommission. In particular, bootstrap on scale-out. This
is important for elasticity.

Allows multiple decommission/removenode to happen in parallel, which
is important for efficiency.

Flow of decommission/removenode request:
  1) pending and paused, has tablet replicas on target node.
     Tablet scheduler will start draining tablets.
  2) No tablets on target node, request is pending but not paused
  3) Request is scheduled, node is in transition
  4) Request is done

Nodes are considered draining as soon as there is a leave or remove
request on them. If there are tablet replicas present on the target
node, the request is in a paused state and will not be picked by
topology coordinator. The paused state is computed from topology state
automatically on reload.

When request is not paused, its execution starts in
write_both_read_old state. The old tablet_draining state is not
entered (it's deprecated now).

Tablet load balancing will yield the state machine as soon as some
request is no longer paused and ready to be scheduled, based on
standard preemption mechanics.

Fixes #21452

Closes scylladb/scylladb#24129

* https://github.com/scylladb/scylladb:
  docs: Document parallel decommission and removenode and relevant task API
  test: Add tests for parallel decommission/removenode
  test: util: Introduce ensure_group0_leader_on()
  test: tablets: Check that there are no migrations scheduled on draining nodes
  test: lib: topology_builder: Introduce add_draining_request()
  topology_coordinator, tablets: Fail draining operations when tablet migration fails due to critical disk utilization
  tablets: topology_coordinator: Refactor to propagate reason for migration rollback
  tablet_allocator: Skip co-location on draining nodes
  node_ops: task_manager_module: Populate entity field also for active requests
  tasks: node_ops: Put node id in the entity field
  tasks, node_ops: Unify setting of task_stats in get_status() and get_stats()
  topology: Protect against empty cancelation reason
  tasks, topology: Make pending node operations abortable
  doc: topology-over-raft.md: Fix diagram for replacing, tablet_draining is not engaged
  raft_topology, tablets: Drain tablets in parallel with other topology operations
  virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
  locator: topology: Add "draining" flag to a node
  topology_coordinator: Extract generate_cancel_request_update()
  storage_service: Drop dependency in topology_state_machine.hh in the header
  locator: Extract common code in assert_rf_rack_valid_keyspace()
  topology_coordinator, storage_service: Validate node removal/decommission at request submission time
2026-01-22 13:06:53 +01:00
Piotr Smaron
d4c28690e1 db: fail reads and writes with local consistencty level to a DC with RF=0
When read or write operations are performed on a DC with RF=0 with LOCAL_QUORUM
or LOCAL_ONE consistency level, Cassandra throws `Unavailable` exception.
Scylla allowed such read operations and failed write operations with a cryptic:
"broken promise" error. This occured because the initial availability
check passed (quorum of 0 requires 0 replicas), but execution failed
later when no replicas existed to process the mutation.

This patch adds an explicit RF=0 validation for LOCAL_ONE and LOCAL_QUORUM that
throws before attempting operation execution.

The change also requires `test_query_dc_with_rf_0_does_not_crash_db` to be
upgraded. This testcase was asserting somewhat similar scenario, but wasn't
taking into account the whole matrix of combinations:
- scenarios: successful vs unsuccesful operation outcome
- local consistency levels: LOCAL_QUORUM & LOCAL_ONE
- operations: SELECT (read) & INSERT (write)

and so it's been extended to cover both the pre-existing and the current issues
and the whole matrix of combinations.

Fixes: scylladb/scylladb#27893
2026-01-22 12:49:45 +01:00
Piotr Smaron
9475659ae8 db: consistency_level: split local_quorum_for()
The core of `local_quorum_for()` has been extracted to
`get_replication_factor_for_dc()`, which is going to be used later,
while `local_quorum_for()` itself has been recreated using the exracted
part.
2026-01-22 12:49:23 +01:00
Piotr Smaron
0b3ee197b6 db: consistency_level: fix nrs -> nts abbreviation
`network_topology_strategy` was abbreviated with `nrs`, and not `nts`. I
think someone incorrectly assumed it's 'network Replication strategy', hence
nrs.
2026-01-22 12:48:37 +01:00
Botond Dénes
7d2e6c0170 Merge 'config: add enforce_rack_list option' from Aleksandra Martyniuk
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.

Mark rf_rack_valid_keyspaces as deprecated.

Fixes: https://github.com/scylladb/scylladb/issues/26399.

New feature; no backport needed

Closes scylladb/scylladb#28084

* github.com:scylladb/scylladb:
  test: add test for enforce_rack_list option
  db: mark rf_rack_valid_keyspaces as deprecated
  config: add enforce_rack_list option
  Revert "alternator: require rf_rack_valid_keyspaces when creating index"
2026-01-22 10:27:35 +02:00
Benny Halevy
91df129e21 snapshot: add basic support for snapshot ttl in manifest.json
Store the snapshot `created_at` time and an optional
`expires_at` time.

Fixes SCYLLADB-189

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Benny Halevy
49a3e0914d db: snapshot_ctl: move skip_flush to struct snapshot_options
So we can easily extend it and add more options.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-01-22 09:12:56 +02:00
Botond Dénes
4281d18c2e Merge 'schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Nikos Dragazis
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

Closes scylladb/scylladb#27204

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
2026-01-22 06:50:48 +02:00
Petr Gusev
998ee5b7fb system_keyspace: disable tablet_balancing for strongly_consistent_tables
We don't yet support tablet balancing for strongly consistent tables.
2026-01-21 14:56:00 +01:00
Botond Dénes
a53f989d2f db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163
2026-01-20 12:34:37 +01:00
Aleksandra Martyniuk
7dc371f312 db: mark rf_rack_valid_keyspaces as deprecated
Mark rf_rack_valid_keyspaces option as deprecated. User should
use enforce_rack_list option instead.

The option can still be used and it does not change it's behavior.

Docs is updated accordingly.
2026-01-20 09:58:57 +01:00
Aleksandra Martyniuk
761ace4f05 config: add enforce_rack_list option
Add enforce_rack_list option. When the option is set to true,
all tablet keyspaces have rack list replication factor.

When the option is on:
- CREATE STATEMENT always auto-extends rf to rack lists;
- ALTER STATEMENT fails when there is numeric rf in any DC.

The flag is set to false by default and a node needs to be restarted
in order to change its value. Starting a node with enforce_rack_list
option will fail, if there are any tablet keyspaces with numeric rf
in any DC.

enforce_rack_list is a per-node option and a user needs to ensure
that no tablet keyspace is altered or created while nodes in
the cluster don't have the consistent value.
2026-01-20 09:58:51 +01:00
Avi Kivity
36347c3ce9 Merge 'db/system_keyspace: remove namespace v3' from Botond Dénes
Cassandra changed their system tables in 3.0. We migrated to the new system table layout in 2017, in ScyllaDB 2.0.
System tables introduced in Cassandra 3.0, as well as the 3.0 variant of pre-existing system tables were added to the db::system_table::v3 namespace.
We ended up adding some new ScyllaDB-only system tables to this namespace as well.

As the dust settled, most of the v3 system tables ended up being either simple aliases to non-v3 tables, or new tables.
Either way, the codebase uses just one variant of each table for a long time now the v3:: distinction is pointless.

Remove the v3 namespace and unify the table listing under the top-level db::system_keyspace scope.

Code cleanup, no backport

Closes scylladb/scylladb#28146

* github.com:scylladb/scylladb:
  db/system_keyspace: move remining tables out of v3 keyspace
  db/system_keyspace: relocate truncated() and commitlog_cleanups()
  db/system_keyspace: drop v3::local()
  db/system_keyspace: remove duplicate table names from v3
2026-01-19 20:54:38 +02:00
Botond Dénes
e01041d3ee db/system_keyspace: move remining tables out of v3 keyspace
The last remining tables in the v3 keyspace are those that are genuinely
distinct -- added by Cassandra 3.0 or >= ScyllaDB 2.0.
Move these out of the v3 keyspace too, with this the v3 keyspace is
defunct and removed.
2026-01-19 12:32:21 +02:00
Botond Dénes
ce57ef94bd db/system_keyspace: relocate truncated() and commitlog_cleanups()
The name variables of these tables is outside the v3 namespace but the
method defining their schema is in the v3 namespace. Relocate the
methods out from the v3 namespace, to the scope where the name variables
live.
The methods are moved to the private: part of system_keyspace, as they
don't have external users currently.
2026-01-19 12:32:21 +02:00
Botond Dénes
2ccb8ff666 db/system_keyspace: drop v3::local()
It is unused, the non-v3 variant is used instead.
2026-01-19 12:32:21 +02:00
Botond Dénes
b52a3f3a43 db/system_keyspace: remove duplicate table names from v3
Those table names that are effectively just an alias of the their
counterpart outside of the v3 namespace (struct).

scylla_local() is made public. Currently it is private, but it has
external users, working around the private designation by using the
public v3::scylla_local() alias. This change just makes the existing
status clear.
2026-01-19 12:32:21 +02:00
Ferenc Szili
3e0362ec67 virtual_table: add effective_capacity to load_per_node
This change adds effective_capacity to the virtual table load_per_node.
This value can be useful for debugging size based load balancing.
2026-01-18 16:52:13 +01:00
Tomasz Grabiec
e38ee160fc virtual_tables: Show draining and excluded fields in system.cluster_status and system.load_by_node
It gives a more accurate picture of what happens in the cluster.
2026-01-18 15:36:04 +01:00
Botond Dénes
1e09a34686 replica: add abort polling to memtable and cache readers
Continuing the read once it is aborted (e.g. due to timeout) is a waste
of resources, as the produced results will be discarded.
Poll the permit's abort exception in the memtable and cache reader's
fill_buffer(). This results in one poll per buffer filled (8KB of data).

We already have similar poll for sstable readers, as disk reads are
usually much heavier and therefore it is more important to stop them
ASAP after abort. Cache and memtable reads are usually quick but not
always, hence it is important to also have polling in the cache and
memtable readers.

Refs: #11469
Fixes: #28148

Closes scylladb/scylladb#28149
2026-01-16 18:03:04 +01:00
Avi Kivity
bd08b6e5b2 Merge 'Unify configuration of object storage endpoints (take 2)' from Pavel Emelyanov
To configure S3 storage, one needs to do

```
object_storage_endpoints:
  - name: s3.us-east-1.amazonaws.com
    port: 443
    https: true
    aws_region: us-east-1
```

and for GCS it's

```
object_storage_endpoints:
  - name: https://storage.googleapis.com:433
    type: gs
    credentials_file: <gcp account credentials json file>
```

This PR updates the S3 part to look like

```
object_storage_endpoints:
  - name: https://s3.us-east-1.amazonaws.com:443
    aws_region: us-east-1
```

fixes: #26570

This is 2nd attempt, previous one (#27360) was reverted because it reported endpoint configs in new format via API and CQL always, even if the endpoint was configured in the old way. This "broke" scylla manager and some dtests. This version has this bug fixed, and endpoints are reported in the same format as they were configured with.

About correctness of the changes.

No modifications to existing tests are made here, so old format is respected correctly (as far as it's covered by tests). To prove the new format works the the test_get_object_store_endpoints is extended to validate both options. Some preparations to this test to make this happen come on their own with the PR #28111  to show that they are valid and pass before changing the core code.

Enhancing the way configuration is made, likely no need to backport.

Closes scylladb/scylladb#28112

* github.com:scylladb/scylladb:
  test: Validate S3 endpoints new format works
  docs: Update docs according to new endpoints config option format
  object_storage: Create s3 client with "extended" endpoint name
  s3/storage: Tune config updating
  sstable: Shuffle args for s3_client_wrapper
  test: Rename badconf variable into objconf
  test: Split the object_store/test_get_object_store_endpoints test
2026-01-14 18:29:03 +02:00
Gleb Natapov
bee5f63cb6 topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009
2026-01-14 13:11:27 +01:00
Nikos Dragazis
7fa1f87355 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
d5ec66bc0c schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 20:45:59 +02:00
Nikos Dragazis
76b2d0f961 db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-01-13 18:30:38 +02:00
Avi Kivity
c6dfae5661 treewide: #include Seastar headers with angle brackets
Seastar is an external library from the point of view of
ScyllaDB, so should be included with angle brackets.

Closes scylladb/scylladb#27947
2026-01-13 14:56:15 +02:00