Commit Graph

48884 Commits

Author SHA1 Message Date
Artsiom Mishuta
d589f36645 test.py: small refactoring of how boost test arguments make
During migration, boost tests to pytest, a big portion of the logic was
used "as is" with bad code and bugs

This PR refactors the function that makes an argument for the pytest command:

1)refactor how modes are provided
2)refactor how --skip provided
3)remove shlex.split woraround
2025-08-14 14:45:28 +02:00
Tomasz Grabiec
9fd312d157 Merge 'row_cache: add memtable overlap checks elision optimization for tombstone gc' from Botond Dénes
https://github.com/scylladb/scylladb/issues/24962 introduced memtable overlap checks to cache tombstone GC. This was observed to be very strict and greatly reduce the effectiveness of tombstone GC in the cache, especially for MV workloads, which regularly recycle old timestamp into new writes, so the memtable often has smaller min live timestamp than the timestamp of the tombstones in the cache.

When creating a new memtable, save a snapshot of the tombstone gc state. This snapshot is used later to exclude this memtable from overlap checks for tombstones, whose token have an expiry time larger than that of the tombstone, meaning: all writes in this memtable were produced at a point in time when the current tombstone has already expired. This has the following implications:
* The partition the tombstone is part of was already repaired at the time the memtable was created.
* All writes in the memtable were produced *after* this tombstone's expiry time, these writes cannot be possibly relevant for this tombstone.

Based on this, such memtables are excluded from the overlap checks. With adequately frequent memtable flushes -- so that the tombstone gc state snapshot is refreshed -- most memtables should be excluded from overlap checks, greatly helping the cache's tombstone GC efficiency.

Fixes: https://github.com/scylladb/scylladb/issues/24962

Fixes a regression introduced by https://github.com/scylladb/scylladb/pull/23255 which was backported to all releases, needs backport to all releases as well

Closes scylladb/scylladb#25033

* github.com:scylladb/scylladb:
  docs/dev/tombstone.md: document the memtable overlap check elision optimization
  test/boost/row_cache_test: add test for memtable overlap check elision
  db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
  mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
  db/cache_mutation_reader: use max_purgeable::can_purge()
  replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
  replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
  compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
  replica/table: propagate gc_state to memtable_list
  replica/memtable_list: add tombstone_gc_state* member
  replica/memtable: add tombstone_gc_state_snapshot
  tombstone_gc: introduce tombstone_gc_state_snapshot
  tombstone_gc: extract shared state into shared_tombstone_gc_state
  tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
  tombstone_gc: fold get_group0_gc_time() into its caller
  tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
  tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
  tombstone_gc: refactor get_or_greate_repair_history_for_table()
  replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
  db/read_context: return max_purgeable from get_max_purgeable()
  compaction/compaction_garbage_collector: add formatter for max_purgeable
  mutation: move definition of gc symbols to compaction.cc
  compaction/compaction_garbage_collector: refactor max_purgeable into a class
  test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
  test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
  test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
2025-08-11 23:54:59 +02:00
Michał Chojnowski
3017dbb204 sstables/trie: add trie traversal routines
`trie::node_reader`, added in a previous series, contains
encoding-aware logic for traversing a single node
(or a batch of nodes) during a trie search.

This commits adds encoding-agnostic functions which drive the
the `trie::node_reader` in a loop to traverse the whole branch.

Together, the added functions (`traverse`, `step`, `step_back`)
and the data structure they modify (`ancestor_trail`) constitute
a trie cursor. We might later wrap them into some `trie_cursor`
class, but regardless of whether we are going to do that,
keeping them (also) as free functions makes them easier to test.

Closes scylladb/scylladb#25396
2025-08-11 19:15:09 +03:00
Botond Dénes
660ea9202a docs/dev/tombstone.md: document the memtable overlap check elision optimization 2025-08-11 17:20:12 +03:00
Botond Dénes
65c770f21a test/boost/row_cache_test: add test for memtable overlap check elision 2025-08-11 17:20:12 +03:00
Botond Dénes
7adbb1bd17 db/cache_mutation_reader: obtain gc-before and min-live-ts lazily
Obtaining the gc-before time, or the min-live timestamps (with the
expiry threshold) is not always trivial, so defer it until we know it is
needed. Not all reads will attempt to garbage-collect tombstones, these
reads can now avoid this work.
The downside is that the partition key has to be copied and stored, as
it is necessary for obtaining the min-live timestamp later.
2025-08-11 17:20:12 +03:00
Botond Dénes
f4b0c384fb mutation/mutation_compactor: use max_purgeable::can_purge and max_purgeable::purge_result
Use the optimized can_purge() check instead of the old stricter
direct timestamp comparison method.
2025-08-11 17:20:12 +03:00
Botond Dénes
92e8d2f9b2 db/cache_mutation_reader: use max_purgeable::can_purge()
Use the optimized can_purge() check instead of the old stricter
direct timestamp comparison method.
2025-08-11 17:20:12 +03:00
Botond Dénes
4e15d32151 replica/table: get_max_purgeable_fn_for_cache_underlying_reader(): use max_purgable::combine()
To combine the max purgable values, instead of just combining the
timestamp values. The former way is still correct, but loses the
timestamp explosion optimization, which allows the cache reader to drop
timestamps from the overlap checks.
2025-08-11 17:20:12 +03:00
Botond Dénes
bd32d41cad replica/database: memtable_list::get_max_purgeable(): set expiry-treshold
Use the newly introduced expiry_treshold field of max_purgeable, to help
exclude memtables from the overlap check if possible.
2025-08-11 17:20:12 +03:00
Botond Dénes
cfac9691ff compaction/compaction_garbage_collector: max_purgeable: add expiry_treshold
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
2025-08-11 17:20:11 +03:00
Patryk Jędrzejczak
e14c5e3890 Merge 'raft: enforce odd number of voters in group0' from Emil Maskovsky
raft: enforce odd number of voters in group0

Implement odd number voter enforcement in the group0 voter calculator to ensure proper Raft consensus behavior. Raft consensus requires a majority of voters to make decisions, and odd numbers of voters is preferred because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least three groups).

Fixes: scylladb/scylladb#23266

No backport: This is a new change that is to be only deployed in the new version, so it will not be backported.

Closes scylladb/scylladb#25332

* https://github.com/scylladb/scylladb:
  raft: enforce odd number of voters in group0
  test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
  test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
2025-08-11 15:44:21 +02:00
Tomasz Grabiec
f7c001deff Merge 'key: clustering_bounds_comparator: avoid thread_local initialization guard overhead' from Avi Kivity
I noticed clustering_bounds_comparator was running an unnecessary
thread_local initialization guard. This series switches the variable
to constinit initialization, removing the guard.

Performance measurements (perf-simple-query) show an unimpressive
20 instruction per op reduction. However, each instruction counts!

Before:

```
throughput:
	mean=   203642.54 standard-deviation=1102.99
	median= 204328.69 median-absolute-deviation=955.56
	maximum=204624.13 minimum=202222.19
instructions_per_op:
	mean=   42097.59 standard-deviation=40.07
	median= 42111.83 median-absolute-deviation=30.65
	maximum=42139.88 minimum=42044.91
cpu_cycles_per_op:
	mean=   22664.81 standard-deviation=131.28
	median= 22581.10 median-absolute-deviation=111.57
	maximum=22832.30 minimum=22553.24
```

After:

```
throughput:
	mean=   204397.73 standard-deviation=2277.71
	median= 204942.95 median-absolute-deviation=2191.54
	maximum=207588.30 minimum=202162.80
instructions_per_op:
	mean=   42087.21 standard-deviation=27.30
	median= 42092.75 median-absolute-deviation=20.33
	maximum=42108.33 minimum=42041.51
cpu_cycles_per_op:
	mean=   22589.79 standard-deviation=219.24
	median= 22544.82 median-absolute-deviation=191.98
	maximum=22835.11 minimum=22303.52
```

(Very) minor performance improvement, no backport suggestd.

Closes scylladb/scylladb#25259

* github.com:scylladb/scylladb:
  keys: clustering_bounds_comparator: make thread_local _empty_prefix constinit
  keys: make empty creation clustering_key_prefix constexpr
  managed_bytes: make empty managed_bytes constexpr friendly
  keys: clustering_bounds_comparator: make _empty_prefix a prefix
2025-08-11 13:20:38 +02:00
Anna Stuchlik
1322f301f6 doc: add support for RHEL 10
This commit adds RHEL 10 to the list of supported platforms.

Fixes https://github.com/scylladb/scylladb/issues/25436

Closes scylladb/scylladb#25437
2025-08-11 13:13:37 +02:00
Israel Fruchter
2da26d1fc1 Update tools/cqlsh submodule (v6.0.26)
* tools/cqlsh 02ec7c57...aa1a52c1 (6):
  > build-push.yaml: upgrade cibuildwheel to latest
  > build-push.yml: skip python 3.8 and PyPy builds
  > cqlshlib: make NetworkTopologyStrategy default for autocomplete
  > default to setuptools_scm based version when not packaged
  > chore(deps): update pypa/cibuildwheel action to v2.23.0

Closes scylladb/scylladb#25420
2025-08-11 13:07:47 +03:00
Artsiom Mishuta
dac04a5b97 fix(test.py) incorrect markers argument in boost tests
pytest parkers argument can be space separated like "not unstable"
to pass such argument propperly in CLI(bash) command we should use double quates
due to using  shlex.split with space separation
while we are not support markers in C++ tests we are passig all pytest
arguments

tested locally on command:
./tools/toolchain/dbuild  ./test.py --markers="not unstable" test/boost/auth_passwords_test.cc

before change: no tests ran in 1.12s
after: 8 passed in 2.45s

Closes scylladb/scylladb#25394
2025-08-11 10:43:34 +03:00
Patryk Jędrzejczak
7b77c6cc4a docs: Raft recovery procedure: recommend verifying participation in Raft recovery
This instruction adds additional safety. The faster we notice that
a node didn't restart properly, the better.

The old gossip-based recovery procedure had a similar recommendation
to verify that each restarting node entered `RECOVERY` mode.

Fixes #25375

This is a documentation improvement. We should backport it to all
branches with the new recovery procedure, so 2025.2 and 2025.3.

Closes scylladb/scylladb#25376
2025-08-11 09:21:29 +03:00
Avi Kivity
f49b63f696 tools: toolchain: dbuild: forward container registry credentials
Docker hub rate-limits unauthenticated image pulls, so forward
the host's credentials to the container. This prevents rate limit
errors when running nested containers.

Try the locations for the credentials in order and bind-mount the
first that exists to a location that gets picked up.

Verified with `podman login --get-login docker.io` in the container.

Closes scylladb/scylladb#25354
2025-08-11 09:05:57 +03:00
Botond Dénes
3b1f414fcf replica/table: propagate gc_state to memtable_list 2025-08-11 07:09:19 +03:00
Botond Dénes
9d00d7e08d replica/memtable_list: add tombstone_gc_state* member
To be passed down to the memtable.
2025-08-11 07:09:19 +03:00
Botond Dénes
ef8a21b4cf replica/memtable: add tombstone_gc_state_snapshot
To be used for possibly excluding the memtable from overlap checks with
the cache/sstables, in memtable_list::get_max_purgeable().
2025-08-11 07:09:19 +03:00
Botond Dénes
ab633590f1 tombstone_gc: introduce tombstone_gc_state_snapshot
Returns gc-before times, identical to what tombstone_gc_state would have
returned at the point of taking the snapshot.
2025-08-11 07:09:14 +03:00
Botond Dénes
614d17347a tombstone_gc: extract shared state into shared_tombstone_gc_state
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
2025-08-11 07:09:14 +03:00
Botond Dénes
3a54379330 tombstone_gc: per_table_history_maps::_group0_gc_time: make it a value
No reason for it to be a shared pointer, or even a pointer at all. When
the pointer is not initialized, gc_clock::time_point::min() is used as
the group0 gc time, so we can just replace with a gc_clock::time_point
value initialized to min() and do away with an unnecessary indirection
as well as an allocation. This latter will be even more important after
the next patches.
2025-08-11 07:09:14 +03:00
Botond Dénes
aa43396aac tombstone_gc: fold get_group0_gc_time() into its caller
It has just one caller. This fold makes the code simpler and facilitates
further patching.
2025-08-11 07:09:14 +03:00
Botond Dénes
faa2b5b4d4 tombstone_gc: fold get_or_create_group0_gc_time() into update_group0_refresh_time()
Its only caller. Makes the code simpler and facilitates further
patching.
2025-08-11 07:09:13 +03:00
Botond Dénes
e9d211bbcd tombstone_gc: fold get_or_create_repair_history_for_table() into update_repair_time()
Its only caller. Makes the code simpler and facilitates further
patching.
2025-08-11 07:09:13 +03:00
Botond Dénes
b9f0cabead tombstone_gc: refactor get_or_greate_repair_history_for_table()
This method has 3 lookups into the reconcile history maps in the worst
case. Reduce to just one. Makes the code more streamlined and prepares
the groundwork for the next patch.
2025-08-11 07:09:13 +03:00
Botond Dénes
1d3a3163a3 replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
Also change to the return type to max_purgeable, instead of raw
timestamp. Prepares for further patching of this code.
2025-08-11 07:09:13 +03:00
Botond Dénes
5d69ef5e8b db/read_context: return max_purgeable from get_max_purgeable()
Instead of just the timestamp. Soon more fields will be used.
2025-08-11 07:09:13 +03:00
Botond Dénes
1d2cc6ef12 compaction/compaction_garbage_collector: add formatter for max_purgeable
It is more than just a timestamp already, and it is about to receive
some additional fields.
2025-08-11 07:09:13 +03:00
Botond Dénes
6078c15116 mutation: move definition of gc symbols to compaction.cc
We are used to symbols definition being grouped in one .cc file, but a
symbol declaration and definition living in separate modules
(subfolders) is surprising.
Relocate always_gc, never_gc, can_always_purge and can_never_purge to
compaction/compaction.cc, from mutatiobn/mutation_partition.cc. The
declarations of these symbols is in
compaction/compaction_garbage_collector.hh.
2025-08-11 07:09:13 +03:00
Botond Dénes
ef7d49cd21 compaction/compaction_garbage_collector: refactor max_purgeable into a class
Make members private, add getters and constructors.
This struct will get more functionality soon, so class is a better fit.
2025-08-11 07:09:13 +03:00
Botond Dénes
c150bdd59c test/boost/row_cache_test: refactor test_populating_reader_tombstone_gc_with_data_in_memtable
This test currently uses gc_grace_seconds=0. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Botond Dénes
c052f2ad1d test: rewrite test_compacting_reader_tombstone_gc_with_data_in_memtable in C++
This test will soon need to be changed to use tombstone-gc=repair. This
cannot work as of now, as the test uses a single-node cluster.
The options are the following:
* Make it use more than one nodes
* Make repair work with single node clusters
* Rewrite in C++ where repair can be done synthetically

We chose the last option, it is the simplest one both in terms of code
and runtime footprint.

The new test is in test/boost/row_cache_test.cc
Two changes were done during the migration
* Change the name to
  test_populating_reader_tombstone_gc_with_data_in_memtable
  to better express which cache component this test is targetting;
* Use NullCompactionStrategy on the table instead of disabling
  auto-compaction.
2025-08-11 07:09:13 +03:00
Botond Dénes
e4c048ada1 test/boost/row_cache_test: refactor cache tombstone GC with memtable overlap tests
These tests currently use tombstone-gc=immediate. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
2025-08-11 07:09:13 +03:00
Avi Kivity
6daa6178b1 scripts: pull_github_pr.sh: reject unintended submodule changes
It is easy for submodule changes to slip through during rebase (if
the developer uses the terrible `git add -u`  command) and
for a maintainer to miss it (if they don't go over each change after
a rebase).

Protect against such mishaps by checking if a submodule was updated
(or .gitmodules itself was changes) and aborting the operation.

If the pull request title contains "submodule", assume the operation
was intended.

Allow bypassing the check with --allow-submodule.

Closes scylladb/scylladb#25418
2025-08-10 11:48:34 +03:00
Avi Kivity
c2a2e11c40 Merge 'Prepare the way for incremental repair' from Botond Dénes
With incremental repair, each replica::compaction_group will have 3 logical compaction groups, repaired, repairing and unrepaired. The definition of group is a set of sstables that can be compacted together. The logical groups will share the same instance of sstable_set, but each will have its own logical sstable set. Existing compaction::table_state is a view for a logical compaction group. So it makes sense that each replica::compaction_group will have multiple views. Each view will provide to compaction layer only the sstables that belong to it. That way, we preserve the existing interface between replica and compaction layer, where each compaction::table_state represents a single logical group.
The idea is that all the incremental repair knowledge is confined to repair and replica layer, compaction doesn't want to know about it, it just works on logical groups, what each represents doesn't matter from the perspective of the subsystem. This is the best way forward to not violate layers and reduce the maintenance burden in the long run.
We also proceed to rename table_state to compaction_group_view, since it's a better description. Working with multiple terms is confusing. The placeholder for implementing the sstable classifier is also left in tablet_storage_group_manager, by the time being, all sstables will go to the unrepaired logical set, which preserves the current behavior.

New functionality, no backport required

Closes scylladb/scylladb#25287

* github.com:scylladb/scylladb:
  test: Add test that compaction doesn't cross logical group boundary
  replica: Introduce views in compaction_group for incremental repair
  compaction: Allow view to be added with compaction disabled
  replica: Futurize retrieval of sstable sets in compaction_group_view
  treewide: Futurize estimation of pending compaction tasks
  replica: Allow compaction_group to have more than one view
  Move backlog tracker to replica::compaction_group
  treewide: Rename table_state to compaction_group_view
  tests: adjust for incremental repair
2025-08-09 17:21:17 +03:00
Emil Maskovsky
7c54401d3d raft: enforce odd number of voters in group0
Implement odd number voter enforcement in the group0 voter calculator to
ensure proper Raft consensus behavior. Raft consensus requires a majority
of voters to make decisions, and odd numbers of voters is preferred
because an even number doesn't add additional reliability but introduces
the risk of scenarios where no group can make progress. If an even number
of voters is divided into two groups of equal size during a network
partition, neither group will have majority and both will be unable to
commit new entries. With an odd number of voters, such equal partition
scenarios are impossible (unless the network is partitioned into at least
three groups).

Fixes: scylladb/scylladb#23266
2025-08-08 19:49:20 +02:00
Emil Maskovsky
29ddb2aa18 test/raft: adapt test_tablets_lwt.py for odd voter number enforcement
The test_lwt_timeout_while_creating_paxos_state_table was failing after
implementing odd number voter enforcement in the group0 voter calculator.

Previously with 2 nodes:
- 2 nodes → 2 voters → stop 1 node → 1/2 voters (no quorum) → expected Raft timeout

With odd voter count enforcement:
- 2 nodes → 1 voter → stop 1 node → 0/1 voters → Cassandra availability error

This change updates the test to use 3 nodes instead of 2, ensuring proper
no-quorum scenarios:
- 3 nodes → 3 voters → stop 2 nodes → 1/3 voters (no quorum) → Raft timeout

The test now correctly validates LWT timeout behavior while being compatible
with the odd number voter enforcement requirement.
2025-08-08 19:49:10 +02:00
Emil Maskovsky
7fc75aff3e test/raft: adapt test_raft_no_quorum.py for odd voter enforcement
Update the no-quorum cluster tests to work correctly with the new odd
number voter enforcement in the group0 voter calculator. The tests now
properly account for the changed voter counts when validating no-quorum
scenarios.
2025-08-08 19:48:58 +02:00
Anna Stuchlik
f3d9d0c1c7 doc: add new and removed metrics to the 2025.3 upgrade guide
This commit adds the list of new and removed metrics to the already existing upgrade guide
from 2025.2 to 2025.3.

Fixes https://github.com/scylladb/scylladb/issues/24697

Closes scylladb/scylladb#25385
2025-08-08 13:25:51 +02:00
Avi Kivity
ab45a0edb5 Update seastar submodule
* seastar 60b2e7da...1520326e (36):
  > Merge 'http/client: Fix content length body overflow check (and a bit more)' from Pavel Emelyanov
    test/http: Add test for http_content_length_data_sink
    test/http: Implement some missing methods for memory data sink
    http/client: Fix content length body overflow check
    http/client: Fix misprint in overflow exception message
  > dns: Use TCP connection data_sink directly
  > iostream: Update "used stream" check for output_stream::detach()
  > Update dpdk submodule
  > rpc: server::process: coroutinize
  > iostream: Remove deprecated constructor
  > Merge 'foreign_ptr: add unwrap_on_owner_shard method' from Benny Halevy
    foreign_ptr: add unwrap_on_owner_shard method
    foreign_ptr: release: check_shard with SEASTAR_DEBUG_SHARED_PTR
  > enum: Replace static_assert() with concept
  > rpc: reindent connection::negotiate()
  > rpc: client: use structured binding
  > rpc.cc: reindent
  > queue: Remove duplicating static assertion
  > Merge 'rpc: client: convert main loop to a coroutine' from Avi Kivity
    rpc: client::loop(): restore indentation
    rpc: client: coroutinize client::loop()
    rpc: client: split main loop function
  > Merge 'treewide: replace remaining std::enable_if with constraints' from Avi Kivity
    optimized_optional: replace std::enable_if with constraint
    log: replace std::enable_if with constraint
    rpc: replace std::enable_if with constraint
    when_all: replace std::enable_if with constraints
    transfer: replace std::enable_if with constraints
    sstring: replace std::enable_if with constraint
    simple-stream: replace std::enable_if with constraints
    shared_ptr: replace std::enable_if with constraints
    sharded: replace std::enable_if with constraints for sharded_has_stop
    sharded: replace std::enable_if with constraints for peering_sharded_service
    scollectd: replace std::enable_if with constraints for type inference
    scollectd: replace std::enable_if with constraints for ser/deser
    metrics: replace std::enable_if with constraints
    chunked_fifo: replace std::enable_if with constraint
    future: replace std::enable_if with constraints
  > websocket: Avoid sending scattered_message to output_stream
  > websocket: Remove unused scattered_message.hh inclusion
  > aio: Squash aio_nowait_supported into fs_info::nowait_works
  > Merge 'reactor: coroutinize spawn()' from Avi Kivity
    reactor: restore indentation for spawn()
    reactor: coroutinize spawn()
  > modules: export coroutine facilities
  > Merge 'reactor: coroutinize some file-related functions' from Avi Kivity
    reactor: adjust indentation
    reactor: coroutinize reactor::make_pipe()
    reactor: coroutinize reactor::inotify_add_watch()
    reactor: coroutinize reactor::read_directory()
    reactor: coroutinize reactor::file_type()
    reactor: coroutinize reactor::chmod()
    reactor: coroutinize reactor::link_file()
    reactor: coroutinize reactor::rename_file()
    reactor: coroutinize open_file_dma()
  > memory: inline disable_abort_on_alloc_failure_temporarily
  > Merge 'addr2line timing and optimizations' from Travis Downs
    addr2line: add basic timing support
    addr2line: do a quick check for 0x in the line
    addr2line: don't load entire file
    addr2line: typing fixing
  > posix: Replace static_assert with concept
  > tls: Push iovec with the help of put(vector<temporary_buffer>)
  > io_queue: Narrow down friendship with reactor
  > util: drop concepts.hh
  > reactor: Re-use posix::to_timespec() helper
  > Fix incorrect defaults for io queue iops/bandwidth
  > net: functions describing ssl connection
  > Add label values to the duplicate metrics exception
  > Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov
    test: Add unit test for cross-sched-groups wakeups
    test: Add unit test for fair CPU scheduling
    test: Add unit test for basic supergrops manipulations
    test: Add perf test for context switch latency
    scheduling: Add an internal method to get group's supergroup
    reactor: Add supergroup get_shares() API
    reactor: Add supergroup::set_shares() API
    reactor: Create scheduling groups in supergroups
    reactor: Supergroups destroying API
    reactor: Supergroups creating API
    reactor: Pass parent pointer to task_queue from caller
    reactor: Wakeup queue group on child activation
    reactor: Add pure virtual sched_entity::run_tasks() method
    reactor: Make task_queue_group be sched_entity too
    reactor: Split task_queue_group::run_some_tasks()
    reactor: Count and limit supergroup children
    reactor: Link sched entity to its parent
    reactor: Switch activate(task_queue*) to work on sched_entity
    reactor: Move set_shares() to sched_entity()
    reactor: Make account_runtime() work with sched_entity
    reactor: Make insert_activating_task_queue() work on sched_entity
    reactor: Make pop_active_task_queue() work on sched_entity
    reactor: Make insert_active_task_queue() work on sched_entity
    reactor: Move timings to sched_entity
    reactor: Move active bit to sched_entity
    reactor: Move shares to sched_entity
    reactor: Move vruntime to sched_entity
    reactor: Introduce sched_entity
    reactor: Rename _activating_task_queues -> _activating
    reactor: Remove local atq* variable
    reactor: Rename _active_task_queues -> _active
    reactor: Move account_runtime() to task_queue_group
    reactor: Move vruntime update from task_queue into _group
    reactor: Simplify task_queue_group::run_some_tasks()
    reactor: Move run_some_tasks() into task_queue_group
    reactor: Move insert_activating_task_queues() into task_queue_group
    reactor: Move pop_active_task_queue() into task_queue_group
    reactor: Move insert_active_task_queue() into task_queue_group
    reactor: Introduce and use task_queue_group::activate(task_queue)
    reactor: Introduce task_queue_group::active()
    reactor: Wrap scheduling fields into task_queue_group
    reactor: Simplify task_queue::activate()
    reactor: Rename task_queue::activate() -> wakeup()
    reactor: Make activate() method of class task_queue
    reactor: Make task_queue::run_tasks() return bool
    reactor: Simplify task_queue::run_tasks()
    reactor: Make run_tasks() method of class task_queue
  > Fix hang in io_queue for big write ioproperties numbers
  > split random io buffer size in 2 options
  > reactor: document run_in_background
  > Merge 'Add io_queue unit test for checking request rates' from Robert Bindar
    Add unit test for validating computed params in io_queue
    Move `disk_params` and `disk_config_params` to their own unit
    Add an overload for `disk_config_params::generate_config`

Closes scylladb/scylladb#25404
2025-08-08 12:24:39 +03:00
Botond Dénes
70aa81990b Merge 'Alternator - add the ability to write, not just read, system tables' from Nadav Har'El
In commit 44a1daf we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible

This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy).

Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write"

No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low.

Fixes #12348

Closes scylladb/scylladb#19147

* github.com:scylladb/scylladb:
  test/alternator: utility functions for changing configuration
  alternator: add optional support for writing to system table
  test/alternator: reduce duplicated code
2025-08-08 09:13:15 +03:00
Raphael S. Carvalho
beaaf00fac test: Add test that compaction doesn't cross logical group boundary
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:01 +03:00
Raphael S. Carvalho
d351b0726b replica: Introduce views in compaction_group for incremental repair
Wired the unrepaired, repairing and repaired views into compaction_group.

Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.

Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.

Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.

From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.

The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
61cb02f580 compaction: Allow view to be added with compaction disabled
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
9d3755f276 replica: Futurize retrieval of sstable sets in compaction_group_view
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.

Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:58:00 +03:00
Raphael S. Carvalho
20c3301a1a treewide: Futurize estimation of pending compaction tasks
This is to allow futurization of compaction_group_view method that
retrieves sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00
Raphael S. Carvalho
af3592c658 replica: Allow compaction_group to have more than one view
In order to support incremental repair, we'll allow each
replica::compaction_group to have two logical compaction groups
(or logical sstable sets), one for repaired, another for unrepaired.

That means we have to adapt a few places to work with
compaction_group_view instead, such that no logical compaction
group is missed when doing table or tablet wide operations.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-08-08 06:51:29 +03:00