Commit Graph

9334 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
03cc34e3a0 test: test_maintenance_socket: use cluster_con for driver sessions
The test creates all driver sessions by itself. As a consequence, all
sessions use the default request timeout of 10s. This can be too low for
the debug mode, as observed in scylladb/scylla-enterprise#5601.

In this commit, we change the test to use `cluster_con`, so that the
sessions have the request timeout set to 200s from now on.

Fixes scylladb/scylla-enterprise#5601

This commit changes only the test and is a CI stability improvement,
so it should be backported all the way to 2024.2. 2024.1 doesn't have
this test.

Closes scylladb/scylladb#25510
2025-08-15 09:32:20 +03:00
Taras Veretilnyk
30ff5942c6 database_test: fix race in test_drop_quarantined_sstables
The test_drop_quarantined_sstables test could fail due to a race between
compaction and quarantining of SSTables. If compaction selects
an SSTable before it is moved to quarantine, and change_state is called during
compaction, the SSTable may already be removed, resulting in a
std::filesystem_error due to missing files.

This patch resolves the issue by wrapping the quarantine operation inside
run_with_compaction_disabled(). This ensures compaction is paused on the
compaction group view while SSTables are being quarantined, preventing the
race.

Additionally, updates the test to quarantine up to 1/5 SSTables instead
of one randomly and increases the number of sstables genereted to improve
test scenario.

Fixes scylladb/scylladb#25487

Closes scylladb/scylladb#25494
2025-08-14 20:23:42 +03:00
Taras Veretilnyk
367eaf46c5 keys: from_nodetool_style_string don't split single partition keys
Users with single-column partition keys that contain colon characters
were unable to use certain REST APIs and 'nodetool' commands, because the
API split key by colon regardless of the partition key schema.

Affected commands:
- 'nodetool getendpoints'
- 'nodetool getsstables'
Affected endpoints:
- '/column_family/sstables/by_key'
- '/storage_service/natural_endpoints'

Refs: #16596 - This does not fully fix the issue, as users with compound
keys will face the issue if any column of the partition key contains
a colon character.

Closes scylladb/scylladb#24829
2025-08-14 19:52:04 +03:00
Avi Kivity
1ef6697949 Merge 'service/vector_store_client: Add live configuration update support' from Karol Nowacki
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search service endpoint on the fly.

To improve the clarity the seastar::experimental::http::client is now wrapped in a private http_client class that also holds the host, address, and port information.

Tests have been added to verify that the client correctly handles transitions between enabled/disabled states and successfully switches traffic to a new endpoint after a configuration update.

Closes: VECTOR-102

No backport is needed as this is a new feature.

Closes scylladb/scylladb#25208

* github.com:scylladb/scylladb:
  service/vector_store_client: Add live configuration update support
  test/boost/vector_store_client_test.cc: Refactor vector store client test
  service/vector_store_client: Refactor host_port struct created
  service/vector_store_client: Refactor HTTP request creation
2025-08-14 19:45:06 +03:00
Avi Kivity
fe6e1071d3 Merge 'locator: util: optimize describe_ring' from Benny Halevy
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint information in an unordered_map instead of looking them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per node, yielding 11520 ranges and 9 replicas per range, describe_ring took Before: 30 milliseconds (2.6 microseconds per range) After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test for vnode keyspace
and for tablets.

Fixes #24887

* backport up to 2025.1 as describe_ring slowness was hit in the field with large clusters

Closes scylladb/scylladb#24889

* github.com:scylladb/scylladb:
  locator: util: optimize describe_ring
  locator: util: construct_range_to_endpoint_map: pass is_vnode=true to get_natural_replicas
  vnode_effective_replication_map: do_get_replicas: throw internal error if token not found in map
  locator: effective_replication_map: get_natural_replicas: get is_vnode param
  test: cluster: test_repair: add test_vnode_keyspace_describe_ring
2025-08-14 19:39:17 +03:00
Abhinav Jha
a0ee5e4b85 raft: replication test: change rpc_propose_conf_change test to SEASTAR_THREAD_TEST_CASE
RAFT_TEST_CASE macro creates 2 test cases, one with random 20% packet
loss named name_drops. The framework makes hard coded assumptions about
leader which doesn't hold well in case of packet losses.

This short term fix disables the packet drop variant of the specified test.
It should be safe to re-enable it once the whole framework is re-worked to
remove these hard coded assumptions.

This PR fixes a bug. Hence we need to backport it.

Fixes: scylladb/scylladb#23816

Closes scylladb/scylladb#25489
2025-08-14 13:15:16 +02:00
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Patryk Jędrzejczak
6ad2b71d04 Merge 'LWT: communicate RPC errors to the user' from Petr Gusev
Currently, if the accept or prepare verbs fail on the replica side, the user only receives a generic error message of the form "something went wrong for this table", which provides no insight into the root cause. Additionally, these error messages are not logged by default, requiring the user to restart the node with trace or debug logging to investigate the issue.

This PR improves error handling for the accept and prepare verbs by preserving and propagating the original error messages, making it easier to diagnose failures.

backport: not needed, not a bug

Closes scylladb/scylladb#25318

* https://github.com/scylladb/scylladb:
  test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty
  storage_proxy: preserve accept error messages
  storage_proxy: preserve prepare error message
  storage_proxy: fix log message
  exceptions.hh: fix message argument passing
  exceptions: add constructors that accept explicit error messages
2025-08-14 10:23:32 +02:00
Nadav Har'El
2d3c0eb25a test/alternator: speed up test_ttl_expiration_lsi_key
The Alternator test test_ttl.py::test_ttl_expiration_lsi_key is
currently the second-slowest test/alternator test, run a "whopping"
2.6 seconds (the total of two parameterizations - with vnodes and
tables).

This patch reduces it to 0.9 seconds.

The fix is simple: Unfortunately, tests that need to wait for actual
TTL expiration take time, but the test framework configures the TTL
scanner to have a period of half a second, so the wait should be on
average around 0.25 seconds. But the test code by mistake slept 1.2
seconds between retries. We even had a good "sleep" variable for the
amount of time we should sleep between retries, but forgot to use it.

So after lowering the sleep between retries, this test is still not
instantenous - it still needs to wait up to 0.5 seconds for the
expirations to occur - but it's almost 3 times faster than before.

While working on this test, I also used the opportunity to update its
comment which excused why we are testing LSI and not GSI. Its
suggestions of what is planned for GSI have already become a reality,
so let's update the comment to say so.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25386
2025-08-14 11:21:52 +03:00
Pavel Emelyanov
eaec7c9b2e Merge 'cql3: add default replication strategy to create_keyspace_statement' from Dario Mirovic
When creating a new keyspace, both replication strategy and replication
factor must be stated. For example:
`CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };`

This syntax is verbose, and in all but some testing scenarios
`NetworkTopologyStrategy` is used.

This patch allows skipping replication strategy name, filling it with
`NetworkTopologyStrategy` when that happens. The following syntax is now
valid:
`CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };`
and will give the same result as the previous, more explicit one.

Fixes https://github.com/scylladb/scylladb/issues/16029

Backport is not needed. This is an enhancement for future releases.

Closes scylladb/scylladb#25236

* github.com:scylladb/scylladb:
  docs/cql: update documentation for default replication strategy
  test/cqlpy: add keyspace creation default strategy test
  cql3: add default replication strategy to `create_keyspace_statement`
2025-08-14 11:18:36 +03:00
Andrzej Jackowski
bf8be01086 test: audit: add logging of get_audit_log_list and set_of_rows_before
Without those logs, analysing some test failures is difficult.

Refs: scylladb/scylladb#25442

Closes scylladb/scylladb#25485
2025-08-14 09:53:05 +03:00
Ernest Zaslavsky
dd51e50f60 s3_client: add memory fallback in chunked_download_source
Introduce fallback logic in `chunked_download_source` to handle
memory exhaustion. When memory is low, feed the `deque` with only
one uncounted buffer at a time. This allows slow but steady progress
without getting stuck on the memory semaphore.

Fixes: https://github.com/scylladb/scylladb/issues/25453
Fixes: https://github.com/scylladb/scylladb/issues/25262

Closes scylladb/scylladb#25452
2025-08-14 09:52:10 +03:00
Wojciech Mitros
2ece08ba43 test: run mv tests depending on metrics on a standalone instance
The test_base_partition_deletion_with_metrics test case (and the batch
variant) uses the metric of view updates done during its runtime to check
if we didn't perform too many of them. The test runs in the cqlpy suite,
which  runs all test cases sequentially on one Scylla instance. Because
of this, if another test case starts a process which generates view
updates and doesn't wait for it to finish before it exists, we may
observe too many view updates in test_base_partition_deletion_with_metrics
and fail the test.
In all test cases we make sure that all tables that were created
during the test are dropped at the end. However, that doesn't
stop the view building process immediately, so the issue can happen
even if we drop the view. I confirmed it by adding a test just before
test_base_partition_deletion_with_metrics which builds a big
materialized view and drops it at the end - the metrics check still failed.

The issue could be caused by any of the existing test cases where we create
a view and don't wait for it to be built. Note that even if we start adding
rows after creating the view, some of them may still be included in the view
building, as the view building process is started asynchronously. In such
a scenario, the view building also doesn't cause any issues with the data in
these tests - writes performed after view creation generate view updates
synchronously when they're local (and we're running a single Scylla server),
the corresponding view udpates generated during view building are redundant.

Because we have many test cases which could be causing this issue, instead
of waiting for the view building to finish in every single one of them, we
move the susceptible test cases to be run on separate Scylla instances, in
the "cluster" suite. There, no other test cases will influence the results.

Fixes https://github.com/scylladb/scylladb/issues/20379

Closes scylladb/scylladb#25209
2025-08-13 15:08:50 +03:00
Petr Gusev
3f287275b8 test_tablets_lwt: add test_error_message_for_timeout_due_to_uncertainty 2025-08-13 14:03:57 +02:00
Benny Halevy
50abeb1270 locator: util: optimize describe_ring
This change includes basic optimizations to
locator::describe_ring, mainly caching the per-endpoint
information in an unordered_map instead of looking
them up in every inner-loop.

This yields an improvement of 20% in cpu time.
With 45 nodes organized as 3 dcs, 3 racks per dc, 5 nodes per rack, 256 tokens per
node, yielding 11520 ranges and 9 replicas per range, describe_ring took
Before: 30 milliseconds (2.6 microseconds per range)
After:  24 milliseconds (2.1 microseconds per range)

Add respective unit test of describe_ring for tablets.
A unit test for vnodes already exists in
test/nodetool/test_describering.py

Fixes #24887

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:42:25 +03:00
Benny Halevy
f22a870a04 test: cluster: test_repair: add test_vnode_keyspace_describe_ring
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-13 12:39:40 +03:00
Yaniv Michael Kaul
b75799c21c skip instead of xfail test_change_replication_factor_1_to_0
It's a waste of good machine time to xfail this rather than just skip.
It takes >3m just to run the test and xfail.
We have a marker for it, we know why we skip it.

Fixes: https://github.com/scylladb/scylladb/issues/25310
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#25311
2025-08-13 10:32:22 +02:00
Ernest Zaslavsky
380c73ca03 s3_client: make memory semaphore acquisition abortable
Add `abort_source` to the `get_units` call for the memory semaphore
in the S3 client, allowing the acquisition process to be aborted.

Fixes: https://github.com/scylladb/scylladb/issues/25454

Closes scylladb/scylladb#25469
2025-08-13 08:48:55 +03:00
Dario Mirovic
ef63d343ba test/cqlpy: add keyspace creation default strategy test
Add a test case for create keyspace default replication strategy.
It is expected that the default replication strategy is `NetworkTopologyStrategy`.

Refs: #16029
2025-08-13 01:52:00 +02:00
Dario Mirovic
bc8bb0873d cql3: add default replication strategy to create_keyspace_statement
When creating a new keyspace, both replication strategy and replication
factor must be stated. For example:
`CREATE KEYSPACE ks WITH REPLICATION = { 'class' : 'NetworkTopologyStrategy', 'replication_factor' : 3 };`

This syntax is verbose, and in all but some testing scenarios
`NetworkTopologyStrategy` is used.

This patch allows skipping replication strategy name, filling it with
`NetworkTopologyStrategy` when that happens. The following syntax is now
valid:
`CREATE KEYSPACE ks WITH REPLICATION = { 'replication_factor' : 3 };`
and will give the same result as the previous, more explicit one.

Fixes #16029
2025-08-13 01:51:53 +02:00
Taras Veretilnyk
b7097b2993 database_test: fix abandoned futures in test_drop_quarantined_sstables
The lambda passed to do_with_cql_env_thread() in test_drop_quarantined_sstables
was mistakenly written as a coroutine.
This change replaces co_await with .get() calls on futures
and changes lambda return type to void.

Fixes scylladb/scylladb#25427

Closes scylladb/scylladb#25431
2025-08-12 13:31:06 +03:00
Patryk Jędrzejczak
a1b2f99dee Merge 'test: test_mv_backlog: fix to consider internal writes' from Michael Litvak
The PR fixes a test flakiness issue in test_mv_backlog related to reading metrics.

The first commit fixes a more general issue in the ScyllaMetrics helper class where it doesn't return the value of all matching lines when a specific shard is requested, but it breaks after the first match.

The second commit fixes a test issue where it expects exactly one write to be throttled, not taking into account other internal writes that may be executed during this time.

Fixes https://github.com/scylladb/scylladb/issues/23139

backport to improve CI stability - test only change

Closes scylladb/scylladb#25279

* https://github.com/scylladb/scylladb:
  test: test_mv_backlog: fix to consider internal writes
  test/pylib/rest_client: fix ScyllaMetrics filtering
2025-08-12 10:05:15 +02:00
Karol Nowacki
22a133df9b service/vector_store_client: Add live configuration update support
Enable runtime updates of vector_store_uri configuration without
requiring server restart.
This allows to dynamically enable, disable, or switch the vector search node endpoint on the fly.
2025-08-12 08:12:53 +02:00
Karol Nowacki
152274735e test/boost/vector_store_client_test.cc: Refactor vector store client test
Consolidate consecutive setup functions into a dedicated helper.
Extract test table creation into a separate function.
Remove redundant assertions to improve clarity.
2025-08-12 08:12:53 +02:00
Tomasz Grabiec
9fd312d157 Merge 'row_cache: add memtable overlap checks elision optimization for tombstone gc' from Botond Dénes
https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache.

When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications:
* The partition the tombstone is part of was already repaired at the time the memtable was created.
* All writes in the memtable were produced *after* this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone.

Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency.

Fixes: https://github.com/scylladb/scylladb/issues/24962

Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well

Closes scylladb/scylladb#25033

* github.com:scylladb/scylladb:
  docs/dev/tombstone.md: document the memtable overlap check elision optimization
  test/boost/row_cache_test: add test for memtable overlap check elision
  db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
  mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
  db/cache_mutation_reader: use max_purgeable::can_purge()
  replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
  replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
  compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
  replica/table: propagate gc_state to memtable_list
  replica/memtable_list: add tombstone_gc_state* member
  replica/memtable: add tombstone_gc_state_snapshot
  tombstone_gc: introduce tombstone_gc_state_snapshot
  tombstone_gc: extract shared state into shared_tombstone_gc_state
  tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
  tombstone_gc: fold get_group0_gc_time() into its caller
  tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
  tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
  tombstone_gc: refactor get_or_greate_repair_history_for_table()
  replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
  db/read_context: return max_purgeable from get_max_purgeable()
  compaction/compaction_garbage_collector: add formatter for max_purgeable
  mutation: move definition of gc symbols to compaction.cc
  compaction/compaction_garbage_collector: refactor max_purgeable into a class
  test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
  test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
  test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
2025-08-11 23:54:59 +02:00
Michał Chojnowski
3017dbb204 sstables/trie: add trie traversal routines
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.

This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.

Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.

Closes scylladb/scylladb#25396
2025-08-11 19:15:09 +03:00
Botond Dénes
65c770f21a test/boost/row_cache_test: add test for memtable overlap check elision 2025-08-11 17:20:12 +03:00
Botond Dénes
cfac9691ff compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak
e14c5e3890 Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky
raft: enforce odd number of voters in group0

Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least three groups).

Fixes: scylladb/scylladb#23266

No backport: This is a new change that is to be only deployed in the new version, so it will not be backported.

Closes scylladb/scylladb#25332

* https://github.com/scylladb/scylladb:
  raft: enforce odd number of voters in group0
  test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
  test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
2025-08-11 15:44:21 +02:00
Tomasz Grabiec
f7c001deff Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity
I noticed clustering_bounds_comparator was running an unnecessary
thread_local initialization guard. This series switches the variable
to constinit initialization, removing the guard.

Performance measurements (perf-simple-query) show an unimpressive
20 instruction per op reduction. However, each instruction counts!

Before:

```
throughput:
	mean=   203642.54 standard-deviation=1102.99
	median= 204328.69 median-absolute-deviation=955.56
	maximum=204624.13 minimum=202222.19
instructions_per_op:
	mean=   42097.59 standard-deviation=40.07
	median= 42111.83 median-absolute-deviation=30.65
	maximum=42139.88 minimum=42044.91
cpu_cycles_per_op:
	mean=   22664.81 standard-deviation=131.28
	median= 22581.10 median-absolute-deviation=111.57
	maximum=22832.30 minimum=22553.24
```

After:

```
throughput:
	mean=   204397.73 standard-deviation=2277.71
	median= 204942.95 median-absolute-deviation=2191.54
	maximum=207588.30 minimum=202162.80
instructions_per_op:
	mean=   42087.21 standard-deviation=27.30
	median= 42092.75 median-absolute-deviation=20.33
	maximum=42108.33 minimum=42041.51
cpu_cycles_per_op:
	mean=   22589.79 standard-deviation=219.24
	median= 22544.82 median-absolute-deviation=191.98
	maximum=22835.11 minimum=22303.52
```

(Very) minor performance improvement, no backport suggestd.

Closes scylladb/scylladb#25259

* github.com:scylladb/scylladb:
  keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit
  keys: make empty creation clustering_key_prefix constexpr
  managed_bytes: make empty managed_bytes constexpr friendly
  keys: clustering_bounds_comparator: make _empty_prefix a prefix
2025-08-11 13:20:38 +02:00
Botond Dénes
ab633590f1 tombstone_gc: introduce tombstone_gc_state_snapshot
Returns gc-before times, identical to what tombstone_gc_state would have
returned at the point of taking the snapshot.
2025-08-11 07:09:14 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Botond Dénes
ef7d49cd21 compaction/compaction_garbage_collector: refactor max_purgeable into a class
Make members private, add getters and constructors.
This struct will get more functionality soon, so class is a better fit.
2025-08-11 07:09:13 +03:00
Botond Dénes
c150bdd59c test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
This test currently uses gc_grace_seconds=0. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Botond Dénes
c052f2ad1d test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
This test will soon need to be changed to use tombstone-gc=repair. This
cannot work as of now, as the test uses a single-node cluster.
The options are the following:
* Make it use more than one nodes
* Make repair work with single node clusters
* Rewrite in C++ where repair can be done synthetically

We chose the last option, it is the simplest one both in terms of code
and runtime footprint.

The new test is in test/boost/row_cache_test.cc
Two changes were done during the migration
* Change the name to
  test_populating_reader_tombstone_gc_with_data_in_memtable
  to better express which cache component this test is targetting;
* Use NullCompactionStrategy on the table instead of disabling
  auto-compaction.
2025-08-11 07:09:13 +03:00
Botond Dénes
e4c048ada1 test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
These tests currently use tombstone-gc=immediate. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Michael Litvak
276a09ac6e test: test_mv_backlog: fix to consider internal writes
The test executes a single write, fetching metrics before and after the
write, and expects the total throttled writes count to be increased
exactly by one.

However, other internal writes (compaction for example) may be executed
during this time and be throttled, causing the metrics to be increased
by more than expected.

To address this, we filter the metrics by the scheduling group label of
the user write, to filter out the compaction writes that run in the
compaction scheduling group.

Fixes scylladb/scylladb#23139
2025-08-10 10:31:02 +02:00
Michael Litvak
5c28cffdb4 test/pylib/rest_client: fix ScyllaMetrics filtering
In the ScyllaMetrics `get` function, when requesting the value for a
specific shard, it is expected to return the sum of all values of
metrics for that shard that match the labels.

However, it would return the value of the first matching line it finds
instead of summing all matching lines.

For example, if we have two lines for one shard like:
some_metric{scheduling_group_name="compaction",shard="0"} 1
some_metric{scheduling_group_name="sl:default",shard="0"} 2

The result of this call would be 1 instead of 3:
get('some_metric', shard="0")

We fix this to sum all matching lines.

The filtering of lines by labels is fixed to allow specifying only some
of the labels. Previously, for the line to match the filter, either the
filter needs to be empty, or all the labels in the metric line had to be
specified in the filter parameter and match its value, which is
unexpected, and breaks when more labels are added.

We also simplify the function signature and the implementation - instead
of having the shard as a separate parameter, it can be specified as a
label, like any other label.
2025-08-10 10:16:00 +02:00
Avi Kivity
c2a2e11c40 Merge 'Prepare the way for incremental repair' from Botond Dénes
With incremental repair, each replica::compaction_group will have 3 logical compaction groups, repaired, repairing and unrepaired. The definition of group is a set of sstables that can be compacted together. The logical groups will share the same instance of sstable_set, but each will have its own logical sstable set. Existing compaction::table_state is a view for a logical compaction group. So it makes sense that each replica::compaction_group will have multiple views. Each view will provide to compaction layer only the sstables that belong to it. That way, we preserve the existing interface between replica and compaction layer, where each compaction::table_state represents a single logical group.
The idea is that all the incremental repair knowledge is confined to repair and replica layer, compaction doesn't want to know about it, it just works on logical groups, what each represents doesn't matter from the perspective of the subsystem. This is the best way forward to not violate layers and reduce the maintenance burden in the long run.
We also proceed to rename table_state to compaction_group_view, since it's a better description. Working with multiple terms is confusing. The placeholder for implementing the sstable classifier is also left in tablet_storage_group_manager, by the time being, all sstables will go to the unrepaired logical set, which preserves the current behavior.

New functionality, no backport required

Closes scylladb/scylladb#25287

* github.com:scylladb/scylladb:
  test: Add test that compaction doesn't cross logical group boundary
  replica: Introduce views in compaction_group for incremental repair
  compaction: Allow view to be added with compaction disabled
  replica: Futurize retrieval of sstable sets in compaction_group_view
  treewide: Futurize estimation of pending compaction tasks
  replica: Allow compaction_group to have more than one view
  Move backlog tracker to replica::compaction_group
  treewide: Rename table_state to compaction_group_view
  tests: adjust for incremental repair
2025-08-09 17:21:17 +03:00
Emil Maskovsky
7c54401d3d raft: enforce odd number of voters in group0
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).

Fixes: scylladb/scylladb#23266
2025-08-08 19:49:20 +02:00
Emil Maskovsky
29ddb2aa18 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
The test_lwt_timeout_while_creating_paxos_state_table was failing after
implementing odd number voter enforcement in the group0 voter calculator.

Previously with 2 nodes:
- 2 nodes → 2 voters → stop 1 node → 1/2 voters (no quorum) → expected Raft timeout

With odd voter count enforcement:
- 2 nodes → 1 voter → stop 1 node → 0/1 voters → Cassandra availability error

This change updates the test to use 3 nodes instead of 2, ensuring proper
no-quorum scenarios:
- 3 nodes → 3 voters → stop 2 nodes → 1/3 voters (no quorum) → Raft timeout

The test now correctly validates LWT timeout behavior while being compatible
with the odd number voter enforcement requirement.
2025-08-08 19:49:10 +02:00
Emil Maskovsky
7fc75aff3e test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
Update the no-quorum cluster tests to work correctly with the new odd
number voter enforcement in the group0 voter calculator. The tests now
properly account for the changed voter counts when validating no-quorum
scenarios.
2025-08-08 19:48:58 +02:00
Benny Halevy
0a20834d2a replica: table: get rid of update_sstables_known_generation
It is not needed anymore.
With that database::_sstable_generation_generator can
be a regular member rather than optional and initialized
later.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
42cb25c470 sstables: sstable_directory: stop tracking highest_generation
It is not needed anymore as we always generate
uuid generations.

Convert sstable_directory_test_table_simple_empty_directory_scan
to use the newly added empty() method instead of
checking the highest generation seen.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
b01524c5a3 replica: distributed_loader: stop tracking highest_generation
It is not needed anymore as we always generate
uuid generations.

Move highest_generation_seen(sharded<sstables::sstable_directory>& directory)
to sstables/sstable_directory module.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
6cc964ef16 sstables: sstable_generation: get rid of uuid_identifiers bool class
Now that all call sites enable uuid_identifiers.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:21 +03:00
Benny Halevy
0ad1898f0a feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
The feature is supported by all live versions since
version 5.4 / 2024.1.

(Although up to 6da758d74c
it could be disabled using the config option)

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-08-08 11:46:15 +03:00
Botond Dénes
70aa81990b Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El
In commit 44a1daf we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible

This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy).

Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write"

No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low.

Fixes #12348

Closes scylladb/scylladb#19147

* github.com:scylladb/scylladb:
  test/alternator: utility functions for changing configuration
  alternator: add optional support for writing to system table
  test/alternator: reduce duplicated code
2025-08-08 09:13:15 +03:00
Raphael S. Carvalho
beaaf00fac test: Add test that compaction doesn't cross logical group boundary
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:01 +03:00
Raphael S. Carvalho
d351b0726b replica: Introduce views in compaction_group for incremental repair
Wired the unrepaired, repairing and repaired views into compaction_group.

Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.

Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.

Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.

From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.

The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00