The idea behind readers/ is that each reader has its minimal header with
just a factory method declaration. The delegating reader is defined in
the factory header because it has a derived class in row_cache_test.cc.
Move the definition to delegating_impl.hh so users not interested in
deriving from it don't pay the price in header include cost.
In this PR, we adjust tests in the cqlpy test suite so they
only use RF-rack-valid keyspaces. After that, we enable
the configuration option `rf_rack_valid_keyspaces` in the
suite by default.
Refs scylladb/scylladb#23428
Backport: backporting to 2025.1 so we can test the option there too.
Closesscylladb/scylladb#23489
* github.com:scylladb/scylladb:
test/cqlpy: Enable rf_rack_valid_keyspaces by default
test: Move test_alter_tablet_keyspace_rf to cluster suite
test/cqlpy: Adjust tests to RF-rack-valid keyspaces
test/cqlpy/cassandra_tests: Adjust to RF-rack-valid keyspaces
A recent commit 370707b111 (re)introduced
a timeout for every group0 Raft operation. This timeout was set to 60
seconds, which, paraphrasing Bill Gates, "ought to be enough for anybody".
However, one of the things we do as a group0 operation is schema
changes, and we already noticed a few years ago, see commit
0b2cf21932, that in some extremely
overloaded test machines where tests run hundreds of times (!) slower
than usual, a single big schema operation - such as Alternator's
DeleteTable deleting a table and multiple of its CDC or view tables -
sometimes takes more than 60 seconds. The above fix changed the
client's timeout to wait for 300 seconds instead of 60 seconds,
but now we also need to increase our Raft timeout, or the server can
time out. We've seen this happening recently making some tests flaky
in CI (issue #23543).
So let's make this timeout configurable, as a new configuration option
group0_raft_op_timeout_in_ms. This option defaults to 60000 (i.e,
60 seconds), the same as the existing default. The test framework
overrides this default with a a higher 300 second timeout, matching
the client-side timeout.
Before this patch, this timeout was already configurable in a strange
way, using injections. But this was a misstep: We already have more
than a dozen timeouts configurable through the normal configration,
and this one should have been configured in the same way. There is
nothing "holy" about the default of 60 seconds we chose, and who
knows maybe in the future we might need to tweek it in the field,
just like we made the other timeouts tweakable. Injections cannot
be used in release mode, but configuration options can.
Fixes#23543
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23717
Name the gates and phased barriers we use
to make it easy to debug gate_closed_exception
Refs https://github.com/scylladb/seastar/pull/2688
* Enhancement only, no backport needed
Closesscylladb/scylladb#23329
* github.com:scylladb/scylladb:
utils: loading_cache: use named_gate
utils: flush_queue: use named_gate
sstables_manager: use named gate
sstables_loader: use named gate
utils: phased_barrier, pluggable: use named gate
utils: s3::client::multipart_upload: use named gate
utils: s3::client: use named_gate
transport: controller: use named gate
tracing: trace_keyspace_helper: use named gate
task_manager: module: use named gate
topology_coordinator: use named gate
storage_service: use named gate
storage_proxy: wait_for_hint_sync_point: use named gate
storage_proxy: remote: use named gate
service: session: use named gate
service: raft: raft_rpc: use named gate
service: raft: raft_group0: use named gate
service: raft: persistent_discovery: use named gate
service: raft: group0_state_machine: use named gate
service: migration_manager: use named gate
replica: table: use named gate
replica: compaction_group, storage_group: use named gate
redis: query_processor: use named gate
repair: repair_meta: use named gate
reader_concurrency_semaphore: use named gate
raft: server_impl: use named gate
querier_cache: use named gate
gms: gossiper: use named gate
generic_server: use named gate
db: sstables_format_listener: use named gate
db: snapshot: backup_task: use named gate
db: snapshot_ctl: use named gate
hints: hints_sender: use named gate
hints: manager: use named gate
hints: hint_endpoint_manager: use named gate
commitlog: segment_manager: use named gate
db: batchlog_manager: use named gate
query_processor: remote: use named gate
compaction: compaction_state: use named gate
alternator/server: use named_gate
The latter class is invented to let tests access private fields of an
sstable (mostly methods). The former is in fact an extended version of
that also does some checks. Howerver, they don't inherit from each
other, and the sstable_assertions partially duplicates some funtionality
of the test one.
Add the inheritance, remove the duplicated methods from the child class,
update the callers (the test class returns future<>s, the assertions one
"knows" it runs in seastar thread) and marm sstable::read_toc() private.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23697
Fix the incorrect log file names between conftest and scylla_manager.
This regression issue, was introduced in #22960.
Currently, scylla manager will output it's logs to the file with the
next pattern:
suite_name.path_to_the_test_file_with_subfolders.run_id.function_name.mode.run_id_cluster.log
On the same time pytest will try to find this log with next name:
suite_name.file_name_without_subfolders_path.py.run_id.function_name.mode.run_id_cluster.log
This inconsistency leads to the situation when the test failed, scylla
manager log file will not be copied to the failed_test directory and
test will have exception on teardown.
Closesscylladb/scylladb#23596
Test suites with `type: Python` are using single Scylla node
created by test.py, but it's handy to print a path to a log
file in pytest log too to make it easier to find the file
on failures.
Closesscylladb/scylladb#23683
We move the test `test_alter_tablet_keyspace_rf` from the cqlpy to the
cluster test suite. The reason behind the change is that the test cannot
be run with `rf_rack_valid_keyspaces` turned on in the configuration.
During the test, we make the keyspace RF-rack-invalid multiple times.
Since RF-rack-validity is a very strong constraint, adjust the test
otherwise is impossible.
By moving it to the cluster test suite, we're able to change the
configuration of the node used in the test, and so the test can work
again.
We adjust three existing Cassandra tests so that they don't create
RF-rack-invalid keyspaces. We modify the replication factor used
in the problematic tests. The changes don't affect the tests as
the value of the RF is unrelated to what they verify. Thanks to
that, we can run them now even with enforced RF-rack-valid keyspaces.
The drawback is that the modified ALTER statements do not modify
the RF at all. However, since the tests seem to verify that the code
responsible for VALIDATING a request works as intended, that should
have little to no impact on them.
Currently if raft is enabled all nodes are voters in group0. However it is not necessary to have all nodes to be voters - it only slows down the raft group operation (since the quorum is large) and makes deployments with asymmetrical DCs problematic (2 DCs with 5 nodes along 1 DC with 10 nodes will lose the majority if large DC is isolated).
The topology coordinator will now maintain a state where there are only limited number of voters, evenly distributed across the DCs and racks.
After each node addition or removal the voters are recalculated and rebalanced if necessary. That means:
* When a new node is added, it might become a voter depending on the current distribution of voters - either if there are still some voter "slots" available, or if the new node is a better candidate than some existing voter (in which case the existing node voter status might be revoked).
* When a voter node is removed or stopped (shut down), its voter status is revoked and another node might become a voter instead (this can also depend on other circumstances, like e.g. changing the number of DCs).
* If a node addition or removal causes a change in number of data centers (DCs) or racks, the rebalance action might become wider (as there are some special rules applying to 1 vs 2 vs more DCs, also changing the number of racks might cause similar effects in the voters distribution)
Special conditions for various number of DCs:
* 1 DC: Can have up to the maximum allowed number of voters (5 - see below)
* 2 DCs: The distribution of the voters will be asymmetric (if possible), meaning that we can tolerate a loss of the DC with the smaller number of voters (if both would have the same number of voters we'd lose majority if any of the DCs is lost). For example, if we have 2 DCs with 2 nodes each, one of them will only have 1 voter (despite the limit of 5). Also, if one of the 2 DCs has more racks than the other and the node count allows it, the DC with the more racks will have more voters.
* 3 and more DCs: The distribution of the voters will be so that every DC has strictly less than half of the total voters (so a loss of any of the DCs cannot lead to the majority loss). Again, DCs with more racks are being preferred in the voter distribution.
At the moment we will be handling the zero-token nodes in the same way as the regular nodes (i.e. the zero-token nodes will not take any priority in the voter distribution). Technically it doesn't make much sense to have a zero-token node that is not a voter (when there are regular nodes in the same DC being voters), but currently the intended purpose of zero-token nodes is to form an "arbiter DC" (in case of 2 DCs, creating a third DC with zero-token nodes only), so for that intended purpose no special handling is needed and will work out of the box. If a preference of zero token nodes will eventually be needed/requested, it will be added separately from this PR.
The maximum number of voters of 5 has been chosen as the smallest "safe" value. We can lose majority when multiple nodes (possibly in different dcs and racks) die independently in a short time span. With less than 5 voters, we would lose majority if 2 voters died, which is very unlikely to happen but not entirely impossible. With 5 voters, at least 3 voters must die to lose majority, which can be safely considered impossible in the case of independent failures.
Currently the limit will not be configurable (we might introduce configurable limits later if that would be needed/requested).
Tests added:
* boost/group0_voter_registry_test.cc: run time on CI: ~3.5s
* topology_custom/test_raft_voters.py: parametrized with 1 or 3 nodes per DC, the run time on CI: 1: ~20s. 3: ~40s, approx 1 min total
Fixes: scylladb/scylladb#18793
No backport: This is a new feature that will not be backported.
Closesscylladb/scylladb#21969
* https://github.com/scylladb/scylladb:
raft: distribute voters by rack inside DC
raft/test: fix lint warnings in `test_raft_no_quorum`
raft/test: add the upgrade test for limited voters feature
raft topology: handle on_up/on_down to add/remove node from voters
raft: fix the indentation after the limited voters changes
raft: implement the limited voters feature
raft: drop the voter removal from the decommission
raft/test: disable the `stop_before_becoming_raft_voter` test
raft/test: stop the server less gracefully in the voters test
After load-balancer was made capacity-aware it no longer equalizes tablet count per shard, but rather utilization of shard's storage. This makes the old presentation mode not useful in assessing whether balance was reached, since nodes with less capacity will get fewer tablets when in balanced state. This PR adds a new default presentation mode which scales tablet size by its storage utilization so that tablets which have equal shard utilization take equal space on the graph.
To facilitate that, a new virtual table was added: system.load_per_node, which allows the tool to learn about load balancer's view on per-node capacity. It can also serve as a debugging interface to get a view of current balance according to the load-balancer.
Closesscylladb/scylladb#23584
* github.com:scylladb/scylladb:
tablet-mon.py: Add presentation mode which scales tablet size by its storage utilization
tablet-mon.py: Center tablet id text properly in the vertical axis
tablet-mon.py: Show migration stage tag in table mode only when migrating
virtual-tables: Introduce system.load_per_node
virtual_tables: memtable_filling_virtual_table: Propagate permit to execute()
docs: virtual-tables: Fix instructions
service: tablets: Keep load_stats inside tablet_allocator
There are few places that want to pause until a message is received from
the test. There's a convenience one-line suger to do it.
One test needs update its expectations about log message that appears
when scylle steps on it and actually starts waiting.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23390
This adaptor adapts a mutation reader pausable consumer to the frozen
mutation visitor interface. The pausable consumer protocol allows the
consumer to skip the remaining parts of the partition and resume the
consumption with the next one. To do this, the consumer just has to
return stop_iteration::yes from one of the consume() overloads for
clustering elements, then return stop_iteration::no from
consume_end_of_partition(). Due to a bug in the adaptor, this sequence
leads to terminating the consumption completely -- so any remaining
partitions are also skipped.
This protocol implementation bug has user-visible effects, when the
only user of the adaptor -- read repair -- happens during a query which
has limitations on the amount of content in each partition.
There are two such queries: select distinct ... and select ... with
partition limit. When converting the repaired mutation to to query
result, these queries will trigger the skip sequence in the consumer and
due to the above described bug, will skip the remaining partitions in
the results, omitting these from the final query result.
This patch fixes the protocol bug, the return value of the underlying
consumer's consume_end_of_partition() is now respected.
A unit test is also added which reproduces the problem both with select
distinct ... and select ... per partition limit.
Follow-up work:
* frozen_mutation_consumer_adaptor::on_end_of_partition() calls the
underlying consumer's on_end_of_stream(), so when consuming multiple
frozen mutations, the underlying's on_end_of_stream() is called for
each partition. This is incorrect but benign.
* Improve documentation of mutation_reader::consume_pausable().
Fixes: #20084Closesscylladb/scylladb#23657
Some of the statements in the test are not indented properly
and, as a result, are never run. It's most likely a small mistake,
so let's fix it.
Closesscylladb/scylladb#23659
This series adds a histogram for get and write batch sizes.
It uses the estimated_histogram implementation which starts from 1 with 1.2 exponential factor, which works
extremely tight to 20 but still covers all the way to 100.
Histograms will be reported per node.
**Backport to 2025.1 so we'll have information about user batch size limitation**
Closesscylladb/scylladb#23379
* github.com:scylladb/scylladb:
alternator: Add tests for the batch items histograms
alternator: Add histogram for batch item count
Currently, when we rebuild a tablet, we stream data from all
replicas. This creates a lot of redundancy, wastes bandwidth
and CPU resources.
In this series, we split the streaming stage of tablet rebuild into
two phases: first we stream tablet's data from only one replica
and then repair the tablet.
Fixes: https://github.com/scylladb/scylladb/issues/17174.
Needs backport to 2025.1 to prevent out of space during streaming
Closesscylladb/scylladb#23187
* github.com:scylladb/scylladb:
test: add test for rebuild with repair
locator: service: move to rebuild_v2 transition if cluster is upgraded
locator: service: add transition to rebuild_repair stage for rebuild_v2
locator: service: add rebuild_repair tablet transition stage
locator: add maybe_get_primary_replica
locator: service: add rebuild_v2 tablet transition kind
gms: add REPAIR_BASED_TABLET_REBUILD cluster feature
Adds new live updatable config: uninitialized_connections_semaphore_cpu_concurrency.
It should help to reduce cpu usage by limiting cpu concurrency for new connections. As a last resort when those connections are waiting for initial processing too long (over 1m) they are shed.
New connections_shed and connections_blocked metrics are added for tracking.
Testing:
- manually via simple program creating high number of connection and constantly re-connecting
- added benchmark
Following are benchmark results:
Before:
```
> build/release/test/perf/perf_generic_server --smp=1
170101.41 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4695 insns/op, 3178 cycles/op, 0 errors)
[...]
throughput: mean=173850.06 standard-deviation=1844.48 median=174509.66 median-absolute-deviation=874.23 maximum=175087.49 minimum=170588.54
instructions_per_op: mean=4725.59 standard-deviation=13.35 median=4729.38 median-absolute-deviation=12.49 maximum=4738.61 minimum=4709.96
cpu_cycles_per_op: mean=3135.08 standard-deviation=32.13 median=3122.68 median-absolute-deviation=22.29 maximum=3179.38 minimum=3103.15
```
After:
```
> build/release/test/perf/perf_generic_server --smp=1
167373.19 tps ( 13.1 allocs/op, 0.0 logallocs/op, 7.0 tasks/op, 4821 insns/op, 3371 cycles/op, 0 errors)
[...]
throughput:
mean= 171199.55 standard-deviation=2484.58
median= 171667.06 median-absolute-deviation=2087.63
maximum=173689.11 minimum=167904.76
instructions_per_op:
mean= 4801.90 standard-deviation=16.54
median= 4796.78 median-absolute-deviation=9.32
maximum=4830.71 minimum=4789.81
cpu_cycles_per_op:
mean= 3245.26 standard-deviation=32.28
median= 3230.44 median-absolute-deviation=16.52
maximum=3297.39 minimum=3215.62
```
The patch adds around 67 insns/op so it's effect on performance should be negligible.
Fixes: https://github.com/scylladb/scylladb/issues/22844Closesscylladb/scylladb#22828
* github.com:scylladb/scylladb:
transport: move on_connection_close into connection destructor
test: perf: make aggregated_perf_results formatting more human readable
transport: add blocked and shed connection metrics
generic_server: throttle and shed incoming connections according to semaphore limit
generic_server: add data source and sink wrappers bookkeeping network IO
generic_server: coroutinize part of server::do_accepts
test: add benchmark for generic_server
test: perf: add option to count multiple ops per time_parallel iteration
generic_server: add semaphore for limiting new connections concurrency
generic_server: add config to the constructor
generic_server: add on_connection_ready handler
Can be used to query per-node stats about load as seen by the load
balancer.
In particular, node's capacity will be used by tablet-mon.py to
scale tablet columns so that equal height is equal node utilization.
Changes in configure.py are needed becuase we don't want to embed
this benchmark in scylla binary as perf_simple_query or perf_alternator,
it doesn't directly translate to Scylla performance but we want to use
aggregated_perf_results for precise cpu measurements so we need
different dependecies.
Add a new nodetool cluster super-command. Add nodetool
cluster repair command to repair tablet keyspaces.
It uses the new /storage_service/tablets/repair API.
The nodetool cluster repair command allows you to specify
the keyspace and tables to be repaired. A cluster repair of many
tables will request /storage_service/tablets/repair and wait for
the result synchronously for each table.
The nodetool repair command, which was previously used to repair
keyspaces of any type, now repairs only vnode keyspaces.
Fixes: https://github.com/scylladb/scylladb/issues/22409.
Needs backport to 2025.1 that introduces the new tablet repair API
Closesscylladb/scylladb#22905
* github.com:scylladb/scylladb:
docs: nodetool: update repair and add tablet-repair docs
test: nodetool: add tests for cluster repair command
nodetool: add cluster repair command
nodetool: repair: extract getting hosts and dcs to functions
nodetool: repair: warn about repairing tablet keyspaces
nodetool: repair: move keyspace_uses_tablets function
These tests seem to be hitting the io-uring bug in the kernel from
time-to-time, making CI flaky. Force the use of the AIO backend in these
tests, as a workaround until fixed kernels (>=6.8.13) are available.
Fixes: #23517Fixes: #23546Closesscylladb/scylladb#23648
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
Closesscylladb/scylladb#23255
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
Warn about an attempt to repair tablet keysapce with nodetool repair.
A nodetool cluster repair command to repair tablet keyspaces will
be added in the following patches.
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes#18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23560
Similar to test/cluster/test_data_resurrection_in_memtable.py but works
on a single node and uses more low-level mechanism. These tests can also
reproduce more advanced scenarios, like concurrent reads, with some
reading from flushed memtables.
Such that a given index in the return hosts refers to the same
underlying Scylla instance, as the same index in the passed-in nodes
list. This is what users of this method intuitively expect, but
currently the returned hosts list is unordered (has random order).