Commit Graph

1105 Commits

Author SHA1 Message Date
Tomasz Grabiec
eef798d84f Merge 'Distribute data evenly among primary replicas during restore' from Robert Bindar
Most likely 817fdad uncovered the fact that our choice of primary replica was resonating with tablet allocation and we were ending up picking the same replica as primary within a scope instead of rotating primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes with primary_replica_only=true would put all data into 3 nodes, leaving the other 6 unused. The balancing of the dataset was performed by the subsequent repair step.

This PR fixes this by changing the formula for picking up the primary replica out of a set of eligible replicas from within the passed scope.
The PR also extends the testing scenarios in `test_backup.py` so we get to run restore for a set of topologies, for all combinations of scope, primary_replica_only and min_tablet_counts.
Most of the work was done by @bhalevy [here](https://github.com/scylladb/scylladb/compare/master...bhalevy:scylla:load-balance-primary-replica), this PR just splitted it and did touchups here and there.

Fixes #27281

Closes scylladb/scylladb#27397

* github.com:scylladb/scylladb:
  test: reduce dataset and number of test cases or debug builds
  test: bump repair timeout up, it's sometimes not enough in CI
  test: refactor test_refresh.py to match test_restore_with_streaming_scopes.
  test: extend test_restore_with_streaming_scopes
  test: Adjust test_restore_primary_replica_different_dc_scope_all
  test: Refactor restoring code in test_backup to match SM pattern
  test: add check_mutation_replicas calls after fresh creation of dataset
  test: extend create_dataset to accept consistency_level
  test: refactor check_mutation_replicas so it's more readable
  test: make create_dataset async and refactor so it's configurable
  test: use defaultdict in collect_mutations
  test: add log marks to facilitate reusing server for restore
  locator: tablets: Distribute data evenly among primary replicas during restore
2026-01-14 18:57:55 +01:00
Robert Bindar
d88036db48 locator: tablets: Distribute data evenly among primary replicas during restore
Most likely 817fdad uncovered the fact that our choice of
primary replica was resonating with tablet allocation and we were ending up
picking the same replica as primary within a scope instead of rotating
primaryship among all replicas in the scope.
This created situations where for instance, restoring into a 9 nodes cluster
with primary_replica_only=true would put all data into 3 nodes, leaving
the other 6 unused. The balancing of the dataset was performed by the
subsequent repair step.

split from bhalevy/load-balance-primary-replica

Fixes #27281

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-01-13 11:44:20 +02:00
Botond Dénes
04b8f72946 Merge 'repair: Implement auto repair for tablet repair' from Asias He
repair: Implement auto repair for tablet repair

This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99

New feature. No backport.

Closes scylladb/scylladb#27534

* github.com:scylladb/scylladb:
  topology_coordinator: Add metrics for tablet repair
  repair: Implement auto repair for tablet repair
2026-01-12 14:16:01 +02:00
Asias He
7ba7b25bdd repair: Implement auto repair for tablet repair
This patch implements the basic auto repair support for tablet repair.

It was decided to add no per table configuration for the initial
implementation, so two scylla yaml config options are introduced to set
the default auto repair configs for all the tablet tables.

- auto_repair_enabled_default

Set true to enable auto repair for tablet tables by default. The value
will be overridden by the per keyspace or per table configuration which
is not implemented yet.

- auto_repair_threshold_default_in_seconds

Set the default time in seconds for the auto repair threshold for tablet
tables. If the time since last repair is bigger than the configured
time, the tablet is eligible for auto repair. The value will be
overridden by the per keyspace or per table configuration which is not
implemented yet.

The following metrcis are added:

- auto_repair_needs_repair_nr

The number of tablets with auto repair enabled that needs repair

- auto_repair_enabled_nr

The number of tablets with auto repair enabled

The metrics are useful to tell if auto repair is falling behind.

In the future, more auto repair scheduling will be added, e.g.,
scheduling based on the repaired and unrepaired sstable set size,
tombstone ratio and so on, in addition to the time based scheduling.

Fixes SCYLLADB-99
2026-01-09 16:11:39 +08:00
Botond Dénes
60570d7114 Merge 'topology coordinator: restrict node join/remove to preserve RF-rack validity' from Michael Litvak
Allow creating materialized views and secondary indexes in a tablets keyspace only if it's RF-rack-valid, and enforce RF-rack-validity while the keyspace has views by restricting some operations:
* Altering a keyspace's RF if it would make the keyspace RF-rack-invalid
* Adding a node in a new rack
* Removing / Decommissioning the last node in a rack

Previously the config option `rf_rack_valid_keyspaces` was required for creating views. We now remove this restriction - it's not needed because we always maintain RF-rack-validity for keyspaces with views.

The restrictions are relevant only for keyspaces with numerical RF. Keyspace with rack-list-based RF are always RF-rack-valid.

Fixes scylladb/scylladb#23345
Fixes https://github.com/scylladb/scylladb/issues/26820

backport to relevant versions for materialized views with tablets since it depends on rf-rack validity

Closes scylladb/scylladb#26354

* github.com:scylladb/scylladb:
  docs: update RF-rack restrictions
  cql3: don't apply RF-rack restrictions on vector indexes
  cql3: add warning when creating mv/index with tablets about rf-rack
  service/tablet_allocator: always allow tablet merge of tables with views
  locator: extend rf-rack validation for rack lists
  test: test rf-rack validity when creating keyspace during node ops
  locator: fix rf-rack validation during node join/remove
  test: test topology restrictions for views with tablets
  test: add test_topology_ops_with_rf_rack_valid
  topology coordinator: restrict node join/remove to preserve RF-rack validity
  topology coordinator: add validation to node remove
  locator: extend rf-rack validation functions
  view: change validate_view_keyspace to allow MVs if RF=Racks
  db: enforce rf-rack-validity for keyspaces with views
  replica/db: add enforce_rf_rack_validity_for_keyspace helper
  db: remove enforce parameter from check_rf_rack_validity
  test: adjust test to not break rf-rack validity
2026-01-09 10:01:23 +02:00
Łukasz Paszkowski
62313a6264 load_sketch: Allow populating load_sketch with normalized current load
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.

This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.

Fixes https://github.com/scylladb/scylladb/issues/27620

Closes scylladb/scylladb#27802
2026-01-07 11:49:01 +01:00
Ferenc Szili
621cb19045 load_sketch: use tablet sizes in load computation
This commit changes load_sketch so that it computes node and shard load
based on tablet sizes instead of tablet count.
2025-12-27 10:37:23 +01:00
Ferenc Szili
1c9ec9a76d load_stats: add get_tablet_size_in_transition()
This patch adds a method to load_stats which searches for the tablet
size during tablet transition. In case of tablet migration, the tablet
will be searched on the leaving replica, and during rebuild we will
return the average tablet size of the pending replicas.
2025-12-27 10:37:23 +01:00
Michael Litvak
07d85af433 locator: extend rf-rack validation for rack lists
Extend the RF-rack validation in `assert_rf_rack_valid_keyspace` to
validate rack-list-based replication as well. Previously, validation was
done only for numeric replication.

If the replication is based on a rack list, we validate that all racks
that are required for replication are present in the topology rack map.
If some rack is needed for replication but is missing, or it doesn't
have normal token owner nodes, the validation fails with an error.
2025-12-22 09:14:30 +01:00
Michael Litvak
a738905a4b locator: fix rf-rack validation during node join/remove
If a keyspace is created while a node is joining or being removed, it could
break the rf-rack invariant. For example:
1. We have 3 nodes in 3 racks, no keyspaces
2. A new node starts to join in a new rack - passes validation because
   there are no keyspaces
3. Create a keyspace with rf=3 - passes validation because the joining
   node is not a normal token owner yet
4. The new node becomes a normal token owner
5. The rf-rack invariant is broken. We have rf=3 and 4 racks

To fix this, we change the rf-rack check to consider a node as a token
owner if it's either a normal token owner or it has bootstrap tokens and
is about to become a normal token owner.

Now the condition can't be broken. Consider keyspace creation at
different stages of adding a node in our example:
* Before the node is assigned bootstrap tokens: the node is not
  considered. We can create a keyspace with rf=3 as if the node doesn't
  exist, and then node join will fail in the group0 operation that
  assigns bootstrap tokens, because during this operation we check
  rf-rack validity.
* Assigning bootstrap tokens is a single group0 operation that is
  serialized with keyspace creation. During this operation we check that
  adding the node as a token owner will maintain rf-rack validity for all
  keyspaces.
* After the node is assigned bootstrap tokens and until it becomes a
  normal token owner: it is considered as a transitioning token owner by
  the rf-rack check and the rack is considered a transitioning rack. We
  can't count the rack as a normal rack because the node join may still
  fail and rollback. Trying to create a keyspace with either rf=3 or
  rf=4 will fail because we can end up with either 3 or 4 racks.

Similarly, when removing a node, we validate that removing the node will
maintain rf-rack validity in the same group0 operation that changes the
node state to removing/decommissioning, after which the node becomes a
leaving endpoint, and it's not considered a normal token owner anymore
for the rf-rack check.
2025-12-22 09:14:30 +01:00
Michael Litvak
9e1f78d162 locator: extend rf-rack validation functions
Extend the locator function assert_rf_rack_valid_keyspace to accept
arbitrary topology dc-rack maps and nodes instead of using the current
token metadata.

This allows us to add a new variant of the function that checks rf-rack
validity given a topology change that we want to apply. we will use it
to check that rf-rack validity will be maintained before applying the
topology change.

The possible topology changes for the check are node add and node remove
/ decommission. These operations can change the number of normal racks -
if a new node is added to a new rack, or the last node is removed from a
rack.
2025-12-22 09:14:29 +01:00
Michael Litvak
de1bb84fca db: enforce rf-rack-validity for keyspaces with views
Extend the RF-rack-validity enforcement to keyspaces that have views,
regardless of the option `rf_rack_valid_keyspaces`.

Previously, RF-rack-validity was enforced when `rf_rack_valid_keyspaces`
was set for all keyspaces. Now we want to allow creating MVs in tablet
keyspaces that are RF-rack-valid and enforce the RF-rack-validity even
if the config option is not set.
2025-12-22 09:13:49 +01:00
Dawid Mędrek
1e14c08eee locator/token_metadata: Remove get_host_id()
The function is declared, but it's not defined or used anywhere.

Closes scylladb/scylladb#27374
2025-12-15 10:36:52 +01:00
Benny Halevy
c8cff94a5a api: storage_service/tablets/repair: disable incremental repair by default
Change the default incremental_mode to `disabled` due to
https://github.com/scylladb/scylladb/issues/26041 and
https://github.com/scylladb/scylladb/issues/27414

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-11 14:25:21 +02:00
Tomasz Grabiec
d6c14de380 Merge 'locator/node: include _excluded in missing places' from Patryk Jędrzejczak
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

Closes scylladb/scylladb#27265

* github.com:scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-26 18:29:59 +01:00
Patryk Jędrzejczak
287c9eea65 locator/node: include _excluded in verbose formatter
It can be helpful during debugging.
2025-11-26 13:26:17 +01:00
Patryk Jędrzejczak
4160ae94c1 locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.
2025-11-26 13:26:11 +01:00
Patryk Jędrzejczak
cc273e867d Merge 'fix notification about expiring erm held for to long' from Gleb Natapov
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

Closes scylladb/scylladb#27140

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-26 12:59:00 +01:00
Gleb Natapov
9f97c376f1 token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.
2025-11-25 13:35:24 +02:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Asias He
d51b1fea94 tablets: Allow tablet merge when repair tasks exist
Currently we do not allow tablet merge if either of the tablets contain
a tablet repair request. This could block the tablet merge for a very
long time if the repair requests could not be scheduled and executed.

We can actually merge the repair tasks in most of the cases. This is
because most of the time all tablets are requested to be repaired by a
single API request, so they share the same task_id, request_type and
other parameters. We can merge the repair task info and executes the
repair after the merge.  If they do not share the task info, we could
not merge and have to wait for the repair before merge, which is both
rare and ok.

Another case is that one of the tablet has a repair task info (t1) while
the other tablet (t2) does not have, it is possible the t2 has finished
repair by the same repair request or t2 is not requested to be repaired
at all. We allow merge in this case too to avoid blocking the tablet
merge, with the price of reparing a bit more.

Fixes #26844

Closes scylladb/scylladb#26922
2025-11-20 16:01:23 +01:00
Pavel Emelyanov
f47f2db710 Merge 'Support local primary-replica-only for native restore' from Robert Bindar
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.

The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.

Fixes #26584

Closes scylladb/scylladb#26609

* github.com:scylladb/scylladb:
  Add cluster tests for checking scoped primary_replica_only streaming
  Improve choice distribution for primary replica
  Refactor cluster/object_store/test_backup
  nodetool restore: add primary-replica-only option
  nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
  Enable scoped primary replica only streaming
  Support primary_replica_only for native restore API
2025-11-13 12:11:18 +03:00
Tomasz Grabiec
10b893dc27 Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:

```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
    if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```

This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.

This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.

A version containing this bug has not been released yet, so no backport is needed.

Closes scylladb/scylladb#26946

* github.com:scylladb/scylladb:
  load_stats: add test for migrate_tablet_size()
  load_stats: fix problem with tablet size migration
2025-11-12 23:48:37 +01:00
Ferenc Szili
b77ea1b8e1 load_stats: fix problem with tablet size migration
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.

This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
2025-11-11 14:26:09 +01:00
Benny Halevy
a290505239 utils: stall_free: add dispose_gently
dispose_gently consumes the object moved to it,
clearing it gently before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26356
2025-11-11 12:20:18 +02:00
Robert Bindar
817fdadd49 Improve choice distribution for primary replica
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.

This patch sorts the replica set before passing it through the
scope filter.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2025-11-11 09:18:01 +02:00
Avi Kivity
d458dd41c6 Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.

Adopting newer seastar API, no need to backport

Closes scylladb/scylladb#26747

* github.com:scylladb/scylladb:
  commitlog: Remove unused work::r stream variable
  ec2_snitch: Fix indentation after previous patch
  ec2_snitch: Coroutinize the aws_api_call_once()
  sstable: Construct output_stream for data instantly
  test: Don't reuse on-stack input stream
2025-10-31 21:22:41 +02:00
Tomasz Grabiec
1c0d847281 Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.

This is the second part of the size based load balancing changes:

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26152

* github.com:scylladb/scylladb:
  load_balancer: load_stats reconcile after tablet migration and table resize
  load_stats: change data structure which contains tablet sizes
2025-10-31 09:58:25 +01:00
Tomasz Grabiec
28f6bdc99b cql3: ks_prop_defs: Expand numeric RF to rack list
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.

Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.

Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
2025-10-29 23:32:59 +01:00
Tomasz Grabiec
35166809cb locator: Move rack_list to topology.hh
So that we can use it in locator/tablets.hh and avoid circular dependency
between that header and abstract_replication_strategy.hh
2025-10-29 23:32:58 +01:00
Pavel Emelyanov
92462e502f ec2_snitch: Fix indentation after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:31:08 +03:00
Pavel Emelyanov
7640ade04d ec2_snitch: Coroutinize the aws_api_call_once()
The method connects a socket, grabs in/out streams from it then writes
HTTP request and reads+parses the response. For that it uses class
variables for socket and streams, but there's no real need for that --
all three actually exists throughput the method "lifetime".

To fix it, coroutinizes the method. The same could be achieved my moving
the connected socket and streams into do_with() context, but coroutine
is better than that.

(indentation is left broken)

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:29:25 +03:00
Ferenc Szili
10f07fb95a load_balancer: load_stats reconcile after tablet migration and table resize
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
2025-10-28 12:12:09 +01:00
Aleksandra Martyniuk
910cd0918b locator: use get_primary_replica for get_primary_endpoints
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.

Use get_primary_replica for get_primary_endpoints.

Fixes: https://github.com/scylladb/scylladb/issues/21883.

Closes scylladb/scylladb#26385
2025-10-28 09:56:08 +02:00
Patryk Jędrzejczak
e1c3f666c9 Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev
Problems addressed by this PR

* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
  * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
  * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
  * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
  * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.

Fixes scylladb/scylladb#26150

backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting

Closes scylladb/scylladb#26315

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
  test_fencing: fix due to new version increment
  test_automatic_cleanup: clean it up
  storage_proxy: wait for closing sessions in sstable cleanup fiber
  storage_proxy: rename await_pending_writes -> await_stale_pending_writes
  storage_proxy: use run_fenceable_write
  storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
  storage_proxy: introduce run_fenceable_write
  storage_proxy: move update_fence_version from shared_token_metadata
  storage_proxy: fix start_write() operation scope in apply_locally
  storage_proxy: move post fence check into handle_write
  storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
  storage_proxy::handle_read: add fence check before get_schema
  storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
  sstable_cleanup_fiber: use coroutine::parallel_for_each
  storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
  topology_coordinator: barrier before cleanup
  topology_coordinator: small start_cleanup refactoring
  global_token_metadata_barrier: add fenced flag
2025-10-27 12:35:13 +01:00
Ferenc Szili
b4ca12b39a load_stats: change data structure which contains tablet sizes
This patch changes the tablet size map in load_stats. Previously, this
data structure was:

std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;

and is changed into:

std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;

This allows for improved performance of tablet tablet size reconciliation.
2025-10-24 14:37:00 +02:00
Petr Gusev
c5f447224a storage_proxy: move update_fence_version from shared_token_metadata
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
2025-10-22 16:31:43 +02:00
Petr Gusev
b23f2a2425 tablet_metadata_guard: fix split/merge handling
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.

Fixes scylladb/scylladb#26437
2025-10-22 11:32:37 +02:00
Petr Gusev
ec6fba35aa tablet_metadata_guard: add debug logs 2025-10-22 11:32:37 +02:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Marcin Maliszkiewicz
d67632bfe2 replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
This copy is now used during the whole duration of schema merge.
If it changes due to tablet_hint then it's replicated to all shards as before.
2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz
46bff28a38 db: schema_applier: move pending_token_metadata to locator
It never belonged to tables and views and its placement stems
from location of _tablet_hint handling code.

In the follwing commits we'll reference it in storage_service.cc.
2025-10-14 10:56:26 +02:00
Marcin Maliszkiewicz
c112916215 db: refactor new_token_metadata into pending_token_metadata
It prepares pending_token_metadata to handle both new and copy
of existing metadata for consistent usage in later commit.

It also adds shared_token_metatada getter so that we don't
need to get it from db.
2025-10-14 10:56:26 +02:00
Asias He
13dd88b010 repair: Rename incremental mode name
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.

Before:

- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

After:

- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

Fixes #26503

Closes scylladb/scylladb#26504
2025-10-10 15:21:54 +03:00
Piotr Dulikowski
380f243986 Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }

Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.

Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.

Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.

New feature, no backport required.

Co-authored with @bhalevy

Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525

Closes scylladb/scylladb#26358

* github.com:scylladb/scylladb:
  tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
  locator: Make hasher for endpoint_dc_rack globally accessible
  test: tablets: Add test for replica allocation on rack list changes
  test: lib: topology_builder: generate unique rack names
  test: Add tests for rack list RF
  doc: Document rack-list replication factor
  topology_coordinator: Restore formatting
  topology_coordinator: Cancel keyspace alter on broader set of errors
  topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
  cql3: ks_prop_defs: Preserve old options
  cql3: ks_prop_defs: Introduce flattened()
  locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
  tablet_allocator: Respect binding replicas to racks
  locator: network_topology_strategy: Respect rack list when reallocating tablets
  cql3: ks_prop_defs: Fail with more information when options are not in expected format
  locator, cql3: Support rack lists in replication options
  cql3: Fail early on vnode/tablet flavor alter
  cql3: Extract convert_property_map() out of Cql.g
  schema: Use definition from the header instead of open-coding it
  locator: Abstract obtaining the number of replicas from replication_strategy_config_option
  cql3, locator: Use type aliases for option maps
  locator: Add debug logging
  locator: Pass topology to replication strategy constructor
  abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
2025-10-06 14:14:09 +02:00
Ferenc Szili
20aeed1607 load balancing: extend locator::load_stats to collect tablet sizes
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.

This is the first change in the size based load balancing series.

Closes scylladb/scylladb#26035
2025-10-03 13:37:22 +02:00
Tomasz Grabiec
6962464be7 locator: Make hasher for endpoint_dc_rack globally accessible 2025-10-02 19:45:00 +02:00
Tomasz Grabiec
6b7b0cb628 locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace() 2025-10-02 19:42:39 +02:00
Tomasz Grabiec
6de342ed3e locator: network_topology_strategy: Respect rack list when reallocating tablets 2025-10-02 19:42:39 +02:00