Commit Graph

8620 Commits

Author SHA1 Message Date
Dawid Mędrek
10589e966f test/cluster/mv: Adjust test to RF-rack-valid keyspaces
We adjust the test in the directory so that all of the used
keyspaces are RF-rack-valid throughout the their execution.

Refs scylladb/scylladb#23428

Closes scylladb/scylladb#23490
2025-04-11 14:03:21 +02:00
Patryk Jędrzejczak
07a7a75b98 Merge 'raft: implement the limited voters feature' from Emil Maskovsky
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.

The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.

Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).

Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total

Fixes: scylladb/scylladb#18793

No backport: This is a new feature that will not be backported.

Closes scylladb/scylladb#21969

* https://github.com/scylladb/scylladb:
  raft: distribute voters by rack inside DC
  raft/test: fix lint warnings in `test_raft_no_quorum`
  raft/test: add the upgrade test for limited voters feature
  raft topology: handle on_up/on_down to add/remove node from voters
  raft: fix the indentation after the limited voters changes
  raft: implement the limited voters feature
  raft: drop the voter removal from the decommission
  raft/test: disable the `stop_before_becoming_raft_voter` test
  raft/test: stop the server less gracefully in the voters test
2025-04-10 15:29:15 +02:00
Avi Kivity
9559e53f55 Merge 'Adjust tablet-mon.py for capacity-aware load balancing' from Tomasz Grabiec
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.

To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.

Closes scylladb/scylladb#23584

* github.com:scylladb/scylladb:
  tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
  tablet-mon.py: Center tablet id text properly in the vertical axis
  tablet-mon.py: Show migration stage tag in table mode only when migrating
  virtual-tables: Introduce system.load_per_node
  virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
  docs: virtual-tables: Fix instructions
  service: tablets: Keep load_stats inside tablet_allocator
2025-04-10 14:59:08 +03:00
Pavel Emelyanov
88318d3b50 topology_coordinator: Use shorter fault-injection overloads
There are few places that want to pause until a message is received from
the test. There's a convenience one-line suger to do it.

One test needs update its expectations about log message that appears
when scylle steps on it and actually starts waiting.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23390
2025-04-10 14:05:46 +03:00
Botond Dénes
d67202972a mutation/frozen_mutation: frozen_mutation_consumer_adaptor: fix end-of-partition handling
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.

This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.

This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.

A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.

Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
  underlying consumer's on_end_of_stream(), so when consuming multiple
  frozen mutations, the underlying's on_end_of_stream() is called for
  each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().

Fixes: #20084

Closes scylladb/scylladb#23657
2025-04-10 13:19:57 +03:00
Dawid Mędrek
0ed21d9cc1 test/cluster/test_tablets.py: Fix test errorneous indentation
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.

Closes scylladb/scylladb#23659
2025-04-10 11:06:01 +03:00
Nadav Har'El
258213f73b Merge 'Alternator batch count histograms' from Amnon Heiman
This series adds a histogram for get and write batch sizes.
It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works
extremely tight to 20 but still covers all the way to 100.

Histograms will be reported per node.

**Backport to 2025.1 so we'll have information about user batch size limitation**

Closes scylladb/scylladb#23379

* github.com:scylladb/scylladb:
  alternator: Add tests for the batch items histograms
  alternator: Add histogram for batch item count
2025-04-09 22:41:14 +03:00
Tomasz Grabiec
b5211cca85 Merge 'tablets: rebuild: use repair for tablet rebuild' from Aleksandra Martyniuk
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.

In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.

Fixes: https://github.com/scylladb/scylladb/issues/17174.

Needs backport to 2025.1 to prevent out of space during streaming

Closes scylladb/scylladb#23187

* github.com:scylladb/scylladb:
  test: add test for rebuild with repair
  locator: service: move to rebuild_v2 transition if cluster is upgraded
  locator: service: add transition to rebuild_repair stage for rebuild_v2
  locator: service: add rebuild_repair tablet transition stage
  locator: add maybe_get_primary_replica
  locator: service: add rebuild_v2 tablet transition kind
  gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
2025-04-09 21:35:37 +02:00
Avi Kivity
ed3e4f33fd Merge 'generic_server: throttle and shed incoming connections according to semaphore limit' from Marcin Maliszkiewicz
Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency.

It should help to reduce cpu usage by limiting cpu concurrency for new connections.  As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed.

New connections_shed and connections_blocked metrics are added for tracking.

Testing:
 - manually via simple program creating high number of connection and constantly re-connecting
 - added benchmark

Following are benchmark results:

Before:
```
> build/release/test/perf/perf_generic_server --smp=1
170101.41 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4695 insns/op,    3178 cycles/op,        0 errors)
[...]
throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54
instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96
  cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15
```

After:
```
> build/release/test/perf/perf_generic_server --smp=1
167373.19 tps ( 13.1 allocs/op,   0.0 logallocs/op,   7.0 tasks/op,    4821 insns/op,    3371 cycles/op,        0 errors)
[...]
throughput:
  mean=   171199.55 standard-deviation=2484.58
  median= 171667.06 median-absolute-deviation=2087.63
  maximum=173689.11 minimum=167904.76
instructions_per_op:
  mean=   4801.90 standard-deviation=16.54
  median= 4796.78 median-absolute-deviation=9.32
  maximum=4830.71 minimum=4789.81
cpu_cycles_per_op:
  mean=   3245.26 standard-deviation=32.28
  median= 3230.44 median-absolute-deviation=16.52
  maximum=3297.39 minimum=3215.62
```

The patch adds around 67 insns/op so it's effect on performance should be negligible.

Fixes: https://github.com/scylladb/scylladb/issues/22844

Closes scylladb/scylladb#22828

* github.com:scylladb/scylladb:
  transport: move on_connection_close into connection destructor
  test: perf: make aggregated_perf_results formatting more human readable
  transport: add blocked and shed connection metrics
  generic_server: throttle and shed incoming connections according to semaphore limit
  generic_server: add data source and sink wrappers bookkeeping network IO
  generic_server: coroutinize part of server::do_accepts
  test: add benchmark for generic_server
  test: perf: add option to count multiple ops per time_parallel iteration
  generic_server: add semaphore for limiting new connections concurrency
  generic_server: add config to the constructor
  generic_server: add on_connection_ready handler
2025-04-09 21:41:38 +03:00
Tomasz Grabiec
0b9a75d7b6 virtual-tables: Introduce system.load_per_node
Can be used to query per-node stats about load as seen by the load
balancer.

In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
2025-04-09 20:21:51 +02:00
Tomasz Grabiec
668094dc58 virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
So that population can access read's timeout and mark the permit as awaiting.
2025-04-09 20:21:51 +02:00
Marcin Maliszkiewicz
619944555f test: perf: make aggregated_perf_results formatting more human readable
Before:
throughput: mean=170728.58 standard-deviation=1921.76 median=171084.16 median-absolute-deviation=1501.58 maximum=172913.36 minimum=167288.97
instructions_per_op: mean=4685.89 standard-deviation=12.46 median=4683.92 median-absolute-deviation=9.68 maximum=4706.53 minimum=4666.70
cpu_cycles_per_op: mean=3090.94 standard-deviation=52.69 median=3103.43 median-absolute-deviation=24.55 maximum=3192.99 minimum=3003.00

After:
throughput:
	mean=   168224.81 standard-deviation=854.48
	median= 168829.02 median-absolute-deviation=604.21
	maximum=168829.02 minimum=167620.60
instructions_per_op:
	mean=   4837.02 standard-deviation=20.89
	median= 4851.79 median-absolute-deviation=14.77
	maximum=4851.79 minimum=4822.24
cpu_cycles_per_op:
	mean=   3271.42 standard-deviation=46.29
	median= 3304.16 median-absolute-deviation=32.73
	maximum=3304.16 minimum=3238.69
2025-04-09 10:49:20 +02:00
Marcin Maliszkiewicz
719d04d501 test: add benchmark for generic_server
Changes in configure.py are needed becuase we don't want to embed
this benchmark in scylla binary as perf_simple_query or perf_alternator,
it doesn't directly translate to Scylla performance but we want to use
aggregated_perf_results for precise cpu measurements so we need
different dependecies.
2025-04-09 10:48:42 +02:00
Marcin Maliszkiewicz
b957cedace test: perf: add option to count multiple ops per time_parallel iteration 2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
ed82bede39 generic_server: add semaphore for limiting new connections concurrency
It will be used in following commits.
2025-04-09 10:30:58 +02:00
Marcin Maliszkiewicz
33122d3f93 generic_server: add config to the constructor 2025-04-09 10:30:58 +02:00
Botond Dénes
b65a76ab6f Merge 'nodetool: cluster repair: add a command to repair tablet keyspaces' from Aleksandra Martyniuk
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.

The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.

The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/22409.

Needs backport to 2025.1 that introduces the new tablet repair API

Closes scylladb/scylladb#22905

* github.com:scylladb/scylladb:
  docs: nodetool: update repair and add tablet-repair docs
  test: nodetool: add tests for cluster repair command
  nodetool: add cluster repair command
  nodetool: repair: extract getting hosts and dcs to functions
  nodetool: repair: warn about repairing tablet keyspaces
  nodetool: repair: move keyspace_uses_tablets function
2025-04-09 08:20:34 +03:00
Botond Dénes
5f697d373f test/cqlpy/test_tools.py: use AIO backend in scylla-sstable query tests
These tests seem to be hitting the io-uring bug in the kernel from
time-to-time, making CI flaky. Force the use of the AIO backend in these
tests, as a workaround until fixed kernels (>=6.8.13) are available.

Fixes: #23517
Fixes: #23546

Closes scylladb/scylladb#23648
2025-04-08 20:29:58 +03:00
Tomasz Grabiec
06b49bdf69 Merge 'row_cache: don't garbage-collect tombstones which cover data in memtables' from Botond Dénes
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;

In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.

Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252

The fix will need backport to all live release.

Closes scylladb/scylladb#23255

* github.com:scylladb/scylladb:
  test/boost/row_cache_test: add memtable overlap check tests
  replica/table: add error injection to memtable post-flush phase
  utils/error_injection: add a way to set parameters from error injection points
  test/cluster: add test_data_resurrection_in_memtable.py
  test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
  replica/mutation_dump: don't assume cells are live
  replica/database: do_apply() add error injection point
  replica: improve memtable overlap checks for the cache
  replica/memtable: add is_merging_to_cache()
  db/row_cache: add overlap-check for cache tombstone garbage collection
  mutation/mutation_compactor: copy key passed-in to consume_new_partition()
2025-04-08 17:26:58 +02:00
Robert Bindar
4e3eb2fdac Move direct_failure_detector from root to service/
direct_failure_detector used to be used by gms/ as well,
but that's not the case anymore, so raft/ is the only user.

Fixes #23133

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#23248
2025-04-08 13:03:24 +03:00
Aleksandra Martyniuk
372b562f5e test: add test for rebuild with repair 2025-04-08 10:42:02 +02:00
Aleksandra Martyniuk
02fb71da42 test: nodetool: add tests for cluster repair command 2025-04-08 09:13:14 +02:00
Aleksandra Martyniuk
b81c81c7f4 nodetool: repair: warn about repairing tablet keyspaces
Warn about an attempt to repair tablet keysapce with nodetool repair.

A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
2025-04-08 09:13:14 +02:00
Raphael S. Carvalho
0f59deffaa replica: Fix truncate and drop table after tablet migration happens
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.

Status quo (around the assert in truncate procedure):

1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.

Note: truncated_at is likely above low_mark_at due to steps 2 and 3.

The interesting part of the assert is:
    (truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)

Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().

If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.

truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.

Reproducer:

To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.

If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).

So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.

Proposed solution:

The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.

We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).

But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.

Fixes #18059.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23560
2025-04-08 07:32:58 +03:00
Botond Dénes
0d39091df2 test/boost/row_cache_test: add memtable overlap check tests
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
2025-04-08 00:11:36 -04:00
Botond Dénes
34b18d7ef4 test/cluster: add test_data_resurrection_in_memtable.py
Reproducers for #23252 and #23291 -- cache garbage
collecting tombstones resurrecting data in the memtable.
2025-04-08 00:11:36 -04:00
Botond Dénes
e5afd9b5fb test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).
2025-04-08 00:11:36 -04:00
Botond Dénes
6b5b563ef7 db/row_cache: add overlap-check for cache tombstone garbage collection
The cache should not garbage-collect tombstone which cover data in the
memtable. Add overlap checks (get_max_purgeable) to garbage collection
to detect tombstones which cover data in the memtable and to prevent
their garbage collection.
2025-04-08 00:11:35 -04:00
Avi Kivity
8d2a41db82 Merge "Fixes for gossiper conversion to host id" from Gleb
"
The series contains fixes to gossiper conversion to host id. There are
two fixes where we could erroneously send outdated entry in a gossiper
message and a fix for force_remove_endpoint which was not converted to
work on host id and this caused it to not delete the entry in some cases
(in replace with the same ip case).
"

* 'gleb/host-id-fixes' of github.com:scylladb/scylla-dev:
  gossiper: send newest entry in a digest message
  gossiper: change make_random_gossip_digest to return value instead of modifying passed parameter
  gossiper: move force_remove_endpoint to work on host id
  gossiper: do not send outdated endpoint in gossiper round
2025-04-07 17:04:28 +03:00
Emil Maskovsky
76ceaf129b raft: distribute voters by rack inside DC
Distribute the voters evenly across racks in the datacenters.

When distributing the voters across datacenters, the datacenters with
more racks will be preferred in case of a tie. Also, in case of
asymmetric voter distribution (2 DCs), the DC with more racks will have
more voters (if the node counts allow it).

In case of a single datacenter, the voters will be distributed across
racks evenly (in the similar manner as done for the whole datacenters).

The intention is that similar to losing a datacenter, we want to avoid
losing the majority if a rack goes down - so if there are multiple racks,
we want to distribute the voters across them in such a way that losing
the whole rack will not cause the majority loss (if possible).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
831fae4bff raft/test: fix lint warnings in test_raft_no_quorum
Code cleanup - fixed lint warnings in `test_raft_no_quorum` test.
2025-04-07 12:31:37 +02:00
Emil Maskovsky
92f6662cd1 raft/test: add the upgrade test for limited voters feature
We test the upgrade scenario of the limited voters feature - first we
start the cluster with the limited voters feature disabled ("old code"),
then we upgrade the cluster to the version with the limited voters
feature enabled ("new code").

The nodes are being upgraded one by one and we test that the cluster
still works (doesn't e.g. lose the majority).
2025-04-07 12:31:37 +02:00
Emil Maskovsky
1d06ea3a5a raft: implement the limited voters feature
Currently if raft is enabled all nodes are voters in group0. However it
is not necessary to have all nodes to be voters - it only slows down
the raft group operation (since the quorum is large) and makes
deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along
1 DC with 10 nodes will lose the majority if large DC is isolated).

The topology coordinator will now maintain a state where there are only
limited number of voters, evenly distributed across the DCs and racks.

After each node addition or removal the voters are recalculated and
rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the
  current distribution of voters - either if there are still some voter
  "slots" available, or if the new node is a better candidate than some
  existing voter (in which case the existing node voter status might be
  revoked).
* When a voter node is removed or stopped (shut down), its voter status
  is revoked and another node might become a voter instead (this can also
  depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of datacenters
  (DCs) or racks, the rebalance action might become wider (as there are
  some special rules applying to 1 vs 2 vs more DCs, also changing the
  number of racks might cause similar effects in the voters distribution)

Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible),
  meaning that we can tolerate a loss of the DC with the smaller number
  of voters (if both would have the same number of voters we'd lose the
  majority if any of the DCs is lost).
  For example, if we have 2 DCs with 2 nodes each, one of them will only
  have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has
  more racks than the other and the node count allows it, the DC with
  the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every
  DC has strictly less than half of the total voters (so a loss of any
  of the DCs cannot lead to the majority loss). Again, DCs with more
  racks are being preferred in the voter distribution.

At the moment we will be handling the zero-token nodes in the same way
as the regular nodes (i.e. the zero-token nodes will not take any
priority in the voter distribution). Technically it doesn't make much
sense to have a zero-token node that is not a voter (when there are
regular nodes in the same DC being voters), but currently the intended
purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs,
creating a third DC with zero-token nodes only), so for that intended
purpose no special handling is needed and will work out of the box.
If a preference of zero token nodes will eventually be needed/requested,
it will be added separately from this PR.

Currently the voter limits will not be configurable (we might introduce
configurable limits later if that would be needed/requested).

The feature is enabled by the `group0_limited_voters` feature flag
to avoid issues with cluster upgrade (the feature will be only enabled
once all nodes in the cluster are upgraded to the version supporting
the feature).

Fixes: scylladb/scylladb#18793
2025-04-07 12:31:18 +02:00
Emil Maskovsky
00794af94d raft/test: disable the stop_before_becoming_raft_voter test
The workflow of becoming a voter changes with the "limited voters"
feature, as the node will no longer become a voter on its own, but the
votership is being managed by the topology coordinator. This therefore
breaks the `stop_before_becoming_raft_voter` test, as that injection
relies on the old behavior.

We will disable the test for this particular case for now and address
either fixing of complete removal of the test in a follow-up task.

Refs: scylladb/scylladb#23418
2025-04-07 12:23:25 +02:00
Emil Maskovsky
57df5d013e raft/test: stop the server less gracefully in the voters test
Stopping the test gracefully might hide some issues, therefore we want
to stop it forcefully to make sure that the code can handle it.

Added a parameter to stop gracefully or less gracefully (so that we test
both cases).
2025-04-07 12:22:19 +02:00
Pavel Emelyanov
10376b5b85 db: Re-use database::snapshot_table_on_all_shards()
There are two snapshot-on-all-shards methods on the database -- the one
that snapshots a keyspace and the one that snapshots a vector of tables.
The latter snapshots a single table with a neat helper, while the former
has the helper open-coded.

Re-using the helper in keyspace snapshot is worth it, but needs to patch
the helper to work on uuid, rather than ks:cf pair of strings.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#23532
2025-04-07 11:55:43 +02:00
Nadav Har'El
84fd52315f alternator: in GetRecords, enforce Limit to be <= 1000
Alternator Streams' "GetRecords" operation has a "Limit" parameter on
how many records to return. The DynamoDB documentations says that the
upper limit on this Limit parameter is 1000 - but Alternator didn't
enforce this. In this patch we begin enforcing this highest Limit, and
also add a test for verifying this enforcement. As usual, the new test
passes on DynamoDB, and after this patch - also on Alternator.

The reason why it's useful to have *some* upper limit on Limit is that
the existing executor::get_records() implementation does not really have
preemption points in all the necessary places. In particular, we have a
loop on all returned records without preemption points. We also store
the returned records in a RapidJson vector, which requires a contiguous
allocation.

Even before this patch, GetRecords had a hard limit of 1 MB of results.
But still, in some cases 1 MB of results may be a lot of results, and we
can see stalls in the aforementioned places being O(number of results).

Fixes #23534

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23547
2025-04-07 12:52:03 +03:00
Gleb Natapov
6f53611337 gossiper: move force_remove_endpoint to work on host id
Since the gossiper works on host ids now it is incorrect to leave this
function to work on ip. It makes it impossible to delete outdated entry
since the "gossiper.get_host_id(endpoint) != id" check will always be
false for such entries (get_host_id() always returns most up -to-date
mapping.
2025-04-06 18:39:24 +03:00
Amnon Heiman
b55f24c14d alternator: Add tests for the batch items histograms
This patch adds a test for the batch‑items histogram for both get and
write operations.

It update the check_increases_metric_exact helper function so that it
would get a list of expected value and labels (labels can be None).
This makes it easy to test multiple buckets in a histogram.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2025-04-06 18:22:23 +03:00
Marcin Maliszkiewicz
b94acfb37b test: remove alternator code from perf-simple-query
This kind of benchmark was superseded by perf-alternator
which has more options, workflows and most importantly
measures overhead of http server layer (including json parsing).

There is no need to maintain additional code in perf-simple-query.

Closes scylladb/scylladb#23474
2025-04-06 18:15:16 +03:00
Nadav Har'El
431de48df9 test/alternator: test for item with many attributes
A user complained that he couldn't read or write an item with more than
16 attributes (!) in Alternator. This isn't true, but I realized that we
don't have a simple test for this case - all test use just a few attributes.
So let's add such a test, doing PutItem, UpdateItem and GetItem with 400
attributes. Unsurprisingly, the test passes.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23568
2025-04-03 22:35:49 +03:00
Nadav Har'El
a9a6f9eecc test/alternator: increase timeout in Alternator RBAC test
On our testing infrastructure, tests often run a hundred times (!)
slower than usual, for various reasons that we can't always avoid.
This is why all our test frameworks drastically increase the default
timeouts.

We forgot to increase the timeout in one place - where Alternator tests
use CQL. This is needed for the Alternator role-based access control
(RBAC) tests, which is configured via CQL and therefore the Alternator
test unusually uses CQL.

So in this patch we increase the timeout of CQL driver used by
Alternator tests to the same high timeouts (60-120 seconds) used by
the regular CQL tests. As the famous saying goes, these timeouts should
be enough for anyone.

Fixes #23569.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23578
2025-04-03 22:31:08 +03:00
Botond Dénes
1198213000 Merge 'tablets: Make tablet allocation equalize per-shard load ' from Tomasz Grabiec
Before, it was equalizing per-node load (tablet count), which is wrong
in heterogeneous clusters. Nodes with fewer shards will end up with
overloaded shards.

Refs #23378

Closes scylladb/scylladb#23478

* github.com:scylladb/scylladb:
  tablets: Make tablet allocation equalize per-shard load
  tablets: load_balancer: Fix reporting of total load per node
2025-04-03 16:32:53 +03:00
Botond Dénes
fcdae20fd1 Merge 'Add tablet enforcing option' from Benny Halevy
This series add a new config option: `tablets_mode_for_new_keyspaces` that replaces the existing
`enable_tablets` option. It can be set to the following values:
    disabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option
    enabled:  New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option
    enforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option

`tablets_mode_for_new_keyspaces=disabled` or `tablets_mode_for_new_keyspaces=enabled` control whether
tablets are disabled or enabled by default for new keyspaces, respectively.
In either cases, tablets can be opted-in or out using the `tablets={'enabled':...}`
keyspace option, when the keyspace is created.

`tablets_mode_for_new_keyspaces=enforced` enables tablets by default for new keyspaces,
like `tablets_mode_for_new_keyspaces=enabled`.
However, it does not allow to opt-out when creating
new keyspaces by setting `tablets = {'enabled': false}`

Refs scylladb/scylla-enterprise#4355

* Requires backport to 2025.1

Closes scylladb/scylladb#22273

* github.com:scylladb/scylladb:
  boost/tablets_test: verify failure to create keyspace with tablets and non network replication strategy
  tablets: enforce tablets using tablets_mode_for_new_keyspaces=enforced config option
  db/config: add tablets_mode_for_new_keyspaces option
2025-04-03 16:32:19 +03:00
Tomasz Grabiec
fe8187e594 Merge 'repair: release erm in repair_writer_impl::create_writer when possible' from Aleksandra Martyniuk
Currently, repair_writer_impl::create_writer keeps erm to ensure that a sharder is valid. If we repair a tablet, erm blocks the state machine and no operation on any tablet of this table might be performed.

Use auto_refreshing_sharder and topology_guard to ensure that the operation is safe and that tablet operations on the whole table aren't blocked.

Fixes: #23453.

Needs backport to 2025.1 that introduces the tablet repair scheduler.

Closes scylladb/scylladb#23455

* github.com:scylladb/scylladb:
  \test: add test to check concurrent migration and repair of two different tablets
  repair: release erm in repair_writer_impl::create_writer when possible
2025-04-03 11:15:08 +02:00
Botond Dénes
7bbfa5293f test/cluster/test_read_repair.py: increase read request timeout
This test enables trace-level logging for the mutation_data logger,
which seems to be too much in debug mode and the test read times out.
Increase timeout to 1minute to avoid this.

Fixes: #23513

Closes scylladb/scylladb#23558
2025-04-03 10:42:11 +03:00
Evgeniy Naydanov
84a5037056 test.py: cluster/suite.yaml: update test filters
After switching to subfolders the filter `run_in_debug` for
random failures test was just copied as is, but need to include
the subfolder, actually.

Also, `test_old_ip_notification_repro` was deleted, so, we
don't need it in the `skip_in_debug` list.

Closes scylladb/scylladb#23492
2025-04-02 19:29:27 +03:00
Aleksandra Martyniuk
bae6711809 \test: add test to check concurrent migration and repair of two different tablets 2025-04-02 15:30:17 +02:00
Radosław Cybulski
c36614e16d alternator: add size check to BatchItemWrite
Add a size check for BatchItemWrite command - if the item count is
bigger than configuration value `alternator_maximum_batch_write_size`,
an error will be raised and no modification will happen.

This is done to synchronize with DynamoDB, where maximum size of
BatchItemWrite is 25. To avoid complaints from clients, who use
our feature of BatchWriteItem being limitless we set default value
to 100.

Fixes #5057

Closes scylladb/scylladb#23232
2025-04-02 14:48:00 +03:00
Avi Kivity
882f405eed Merge "Convert gossiper's endpoint state map to be host id based" from Gleb
"
The series makes endpoint state map in the gossiper addressable by host
id instead of ips. The transition has implication outside of the
gossiper as well. Gossiper based topology operations are affected by
this change since they assume that the mapping is ip based.

On wire protocol is not affected by the change as maps that are sent by
the gossiper protocol remain ip based. If old node sends two different
entries for the same host id the one with newer generation is applied.
If new node has two ids that are mapped to the same ip the newer one is
added to the outgoing map.

Interoperability was verified manually by running mixed cluster.

The series concludes the conversion of the system to be host id based.
"

* 'gleb/gossipper-endpoint-map-to-host-id-v2' of github.com:scylladb/scylla-dev:
  gossiper: make examine_gossiper private
  gossiper: rename get_nodes_with_host_id to get_node_ip
  treewide: drop id parameter from gossiper::for_each_endpoint_state
  treewide: move gossiper to index nodes by host id
  gossiper: drop ip from replicate function parameters
  gossiper: drop ip from apply_new_states parameters
  gossiper: drop address from handle_major_state_change parameter list
  gossiper: pass rpc::client_info to gossiper_shutdown verb handler
  gossiper: add try_get_host_id function
  gossiper: add ip to endpoint_state
  serialization: fix std::map de-serializer to not invoke value's default constructor
  gossiper: drop template from  wait_alive_helper function
  gossiper: move get_supported_features and its users to host id
  storage_service: make candidates_for_removal host id based
  gossiper: use peers table to detect address change
  storage_service: use std::views::keys instead of std::views::transform that returns a key
  gossiper: move _pending_mark_alive_endpoints to host id
  gossiper: do not allow to assassinate endpoint in raft topology mode
  gossiper: fix indentation after previous patch
  gossiper: do not allow to assassinate non existing endpoint
2025-04-02 12:30:00 +03:00