Obtaining the gc-before time, or the min-live timestamps (with the
expiry threshold) is not always trivial, so defer it until we know it is
needed. Not all reads will attempt to garbage-collect tombstones, these
reads can now avoid this work.
The downside is that the partition key has to be copied and stored, as
it is necessary for obtaining the min-live timestamp later.
To combine the max purgable values, instead of just combining the
timestamp values. The former way is still correct, but loses the
timestamp explosion optimization, which allows the cache reader to drop
timestamps from the overlap checks.
Allow possibly avoiding overlap checks in the case where the source of
the min-live timestamp is known to only contain data which was written
*after* expiry treshold. Expiry treshold is the upper bound of
tombstone.deletion_time that was already expired at the time of
obtaining this expiry treshold value. Meaning that any write originating
from after this point in time, was generated at a time when such
tombstone was already expired. Hence these writes are not relevant for
the purposes of overlap checks with the tombstone and so their min-live
timestamp can be ignored.
This is important for MV workloads, where writes generated now can have
timestamps going far back in time, possibly blocking tombstone GC of
much older [shadowable] tombstones.
Instead of storing it partially in tombstone_gc and partially in an
external map. Move all external parts into the new
shared_tombstone_gc_state. This new class is responsible for
keeping and updating the repair history. tombstone_gc_state just keeps
const pointers to the shared state as before and is only responsible for
querying the tombstone gc before times.
This separation makes the code easier to follow and also enables further
patching of tombstone_gc_state.
No reason for it to be a shared pointer, or even a pointer at all. When
the pointer is not initialized, gc_clock::time_point::min() is used as
the group0 gc time, so we can just replace with a gc_clock::time_point
value initialized to min() and do away with an unnecessary indirection
as well as an allocation. This latter will be even more important after
the next patches.
This method has 3 lookups into the reconcile history maps in the worst
case. Reduce to just one. Makes the code more streamlined and prepares
the groundwork for the next patch.
We are used to symbols definition being grouped in one .cc file, but a
symbol declaration and definition living in separate modules
(subfolders) is surprising.
Relocate always_gc, never_gc, can_always_purge and can_never_purge to
compaction/compaction.cc, from mutatiobn/mutation_partition.cc. The
declarations of these symbols is in
compaction/compaction_garbage_collector.hh.
This test currently uses gc_grace_seconds=0. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
This test will soon need to be changed to use tombstone-gc=repair. This
cannot work as of now, as the test uses a single-node cluster.
The options are the following:
* Make it use more than one nodes
* Make repair work with single node clusters
* Rewrite in C++ where repair can be done synthetically
We chose the last option, it is the simplest one both in terms of code
and runtime footprint.
The new test is in test/boost/row_cache_test.cc
Two changes were done during the migration
* Change the name to
test_populating_reader_tombstone_gc_with_data_in_memtable
to better express which cache component this test is targetting;
* Use NullCompactionStrategy on the table instead of disabling
auto-compaction.
These tests currently use tombstone-gc=immediate. The introduction
of memtable overlap elision will break these tests because the
optimization is always active with this tombstone-gc.
Switch the tests to use tombstone-gc=repair, which allows for greater
control over when the memtable overlap elision is triggered.
This requires a move to vnodes, as tombstone-gc=repair doesn't
work with RF=1 currently, and using RF=3 won't work with tablets.
It is easy for submodule changes to slip through during rebase (if
the developer uses the terrible `git add -u` command) and
for a maintainer to miss it (if they don't go over each change after
a rebase).
Protect against such mishaps by checking if a submodule was updated
(or .gitmodules itself was changes) and aborting the operation.
If the pull request title contains "submodule", assume the operation
was intended.
Allow bypassing the check with --allow-submodule.
Closesscylladb/scylladb#25418
With incremental repair, each replica::compaction_group will have 3 logical compaction groups, repaired, repairing and unrepaired. The definition of group is a set of sstables that can be compacted together. The logical groups will share the same instance of sstable_set, but each will have its own logical sstable set. Existing compaction::table_state is a view for a logical compaction group. So it makes sense that each replica::compaction_group will have multiple views. Each view will provide to compaction layer only the sstables that belong to it. That way, we preserve the existing interface between replica and compaction layer, where each compaction::table_state represents a single logical group.
The idea is that all the incremental repair knowledge is confined to repair and replica layer, compaction doesn't want to know about it, it just works on logical groups, what each represents doesn't matter from the perspective of the subsystem. This is the best way forward to not violate layers and reduce the maintenance burden in the long run.
We also proceed to rename table_state to compaction_group_view, since it's a better description. Working with multiple terms is confusing. The placeholder for implementing the sstable classifier is also left in tablet_storage_group_manager, by the time being, all sstables will go to the unrepaired logical set, which preserves the current behavior.
New functionality, no backport required
Closesscylladb/scylladb#25287
* github.com:scylladb/scylladb:
test: Add test that compaction doesn't cross logical group boundary
replica: Introduce views in compaction_group for incremental repair
compaction: Allow view to be added with compaction disabled
replica: Futurize retrieval of sstable sets in compaction_group_view
treewide: Futurize estimation of pending compaction tasks
replica: Allow compaction_group to have more than one view
Move backlog tracker to replica::compaction_group
treewide: Rename table_state to compaction_group_view
tests: adjust for incremental repair
* seastar 60b2e7da...1520326e (36):
> Merge 'http/client: Fix content length body overflow check (and a bit more)' from Pavel Emelyanov
test/http: Add test for http_content_length_data_sink
test/http: Implement some missing methods for memory data sink
http/client: Fix content length body overflow check
http/client: Fix misprint in overflow exception message
> dns: Use TCP connection data_sink directly
> iostream: Update "used stream" check for output_stream::detach()
> Update dpdk submodule
> rpc: server::process: coroutinize
> iostream: Remove deprecated constructor
> Merge 'foreign_ptr: add unwrap_on_owner_shard method' from Benny Halevy
foreign_ptr: add unwrap_on_owner_shard method
foreign_ptr: release: check_shard with SEASTAR_DEBUG_SHARED_PTR
> enum: Replace static_assert() with concept
> rpc: reindent connection::negotiate()
> rpc: client:➿ use structured binding
> rpc.cc: reindent
> queue: Remove duplicating static assertion
> Merge 'rpc: client: convert main loop to a coroutine' from Avi Kivity
rpc: client::loop(): restore indentation
rpc: client: coroutinize client::loop()
rpc: client: split main loop function
> Merge 'treewide: replace remaining std::enable_if with constraints' from Avi Kivity
optimized_optional: replace std::enable_if with constraint
log: replace std::enable_if with constraint
rpc: replace std::enable_if with constraint
when_all: replace std::enable_if with constraints
transfer: replace std::enable_if with constraints
sstring: replace std::enable_if with constraint
simple-stream: replace std::enable_if with constraints
shared_ptr: replace std::enable_if with constraints
sharded: replace std::enable_if with constraints for sharded_has_stop
sharded: replace std::enable_if with constraints for peering_sharded_service
scollectd: replace std::enable_if with constraints for type inference
scollectd: replace std::enable_if with constraints for ser/deser
metrics: replace std::enable_if with constraints
chunked_fifo: replace std::enable_if with constraint
future: replace std::enable_if with constraints
> websocket: Avoid sending scattered_message to output_stream
> websocket: Remove unused scattered_message.hh inclusion
> aio: Squash aio_nowait_supported into fs_info::nowait_works
> Merge 'reactor: coroutinize spawn()' from Avi Kivity
reactor: restore indentation for spawn()
reactor: coroutinize spawn()
> modules: export coroutine facilities
> Merge 'reactor: coroutinize some file-related functions' from Avi Kivity
reactor: adjust indentation
reactor: coroutinize reactor::make_pipe()
reactor: coroutinize reactor::inotify_add_watch()
reactor: coroutinize reactor::read_directory()
reactor: coroutinize reactor::file_type()
reactor: coroutinize reactor::chmod()
reactor: coroutinize reactor::link_file()
reactor: coroutinize reactor::rename_file()
reactor: coroutinize open_file_dma()
> memory: inline disable_abort_on_alloc_failure_temporarily
> Merge 'addr2line timing and optimizations' from Travis Downs
addr2line: add basic timing support
addr2line: do a quick check for 0x in the line
addr2line: don't load entire file
addr2line: typing fixing
> posix: Replace static_assert with concept
> tls: Push iovec with the help of put(vector<temporary_buffer>)
> io_queue: Narrow down friendship with reactor
> util: drop concepts.hh
> reactor: Re-use posix::to_timespec() helper
> Fix incorrect defaults for io queue iops/bandwidth
> net: functions describing ssl connection
> Add label values to the duplicate metrics exception
> Merge 'Nested scheduling groups (CPU only)' from Pavel Emelyanov
test: Add unit test for cross-sched-groups wakeups
test: Add unit test for fair CPU scheduling
test: Add unit test for basic supergrops manipulations
test: Add perf test for context switch latency
scheduling: Add an internal method to get group's supergroup
reactor: Add supergroup get_shares() API
reactor: Add supergroup::set_shares() API
reactor: Create scheduling groups in supergroups
reactor: Supergroups destroying API
reactor: Supergroups creating API
reactor: Pass parent pointer to task_queue from caller
reactor: Wakeup queue group on child activation
reactor: Add pure virtual sched_entity::run_tasks() method
reactor: Make task_queue_group be sched_entity too
reactor: Split task_queue_group::run_some_tasks()
reactor: Count and limit supergroup children
reactor: Link sched entity to its parent
reactor: Switch activate(task_queue*) to work on sched_entity
reactor: Move set_shares() to sched_entity()
reactor: Make account_runtime() work with sched_entity
reactor: Make insert_activating_task_queue() work on sched_entity
reactor: Make pop_active_task_queue() work on sched_entity
reactor: Make insert_active_task_queue() work on sched_entity
reactor: Move timings to sched_entity
reactor: Move active bit to sched_entity
reactor: Move shares to sched_entity
reactor: Move vruntime to sched_entity
reactor: Introduce sched_entity
reactor: Rename _activating_task_queues -> _activating
reactor: Remove local atq* variable
reactor: Rename _active_task_queues -> _active
reactor: Move account_runtime() to task_queue_group
reactor: Move vruntime update from task_queue into _group
reactor: Simplify task_queue_group::run_some_tasks()
reactor: Move run_some_tasks() into task_queue_group
reactor: Move insert_activating_task_queues() into task_queue_group
reactor: Move pop_active_task_queue() into task_queue_group
reactor: Move insert_active_task_queue() into task_queue_group
reactor: Introduce and use task_queue_group::activate(task_queue)
reactor: Introduce task_queue_group::active()
reactor: Wrap scheduling fields into task_queue_group
reactor: Simplify task_queue::activate()
reactor: Rename task_queue::activate() -> wakeup()
reactor: Make activate() method of class task_queue
reactor: Make task_queue::run_tasks() return bool
reactor: Simplify task_queue::run_tasks()
reactor: Make run_tasks() method of class task_queue
> Fix hang in io_queue for big write ioproperties numbers
> split random io buffer size in 2 options
> reactor: document run_in_background
> Merge 'Add io_queue unit test for checking request rates' from Robert Bindar
Add unit test for validating computed params in io_queue
Move `disk_params` and `disk_config_params` to their own unit
Add an overload for `disk_config_params::generate_config`
Closesscylladb/scylladb#25404
In commit 44a1daf we added the ability to read Scylla system tables with Alternator. This feature is useful, among other things, in tests that want to read Scylla's configuration through the system table system.config. But tests often want to modify system.config, e.g., to temporarily reduce some threshold to make tests shorter. Until now, this was not possible
This series add supports for writing to system tables through Alternator, and examples of tests using this capability (and utility functions to make it easy).
Because the ability to write to system tables may have non-obvious security consequences, it is turned off by default and needs to be enabled with a new configuration option "alternator_allow_system_table_write"
No backports are necessary - this feature is only intended for tests. We may later decide to backport if we want to backport new tests, but I think the probability we'll want to do this is low.
Fixes#12348Closesscylladb/scylladb#19147
* github.com:scylladb/scylladb:
test/alternator: utility functions for changing configuration
alternator: add optional support for writing to system table
test/alternator: reduce duplicated code
Wired the unrepaired, repairing and repaired views into compaction_group.
Also the repaired filter was wired, so tablet_storage_group_manager
can implement the procedure to classify the sstable.
Based on this classifier, we can decide which view a sstable belongs
to, at any given point in time.
Additionally, we made changes changes to compaction_group_view
to return only sstables that belong to the underlying view.
From this point on, repaired, repairing and unrepaired sets are
connected to compaction manager through their views. And that
guarantees sstables on different groups cannot be compacted
together.
Repairing view specifically has compaction disabled on it altogether,
we can revert this later if we want, to allow repairing sstables
to be compacted with one another.
The benefit of this logical approach is having the classifier
as the single source of truth. Otherwise, we'd need to keep the
sstable location consistest with global metadata, creating
complexity
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
This will allow upcoming work to gently produce a sstable set for
each compaction group view. Example: repaired and unrepaired.
Locking strategy for compaction's sstable selection:
Since sstable retrieval path became futurized, tasks in compaction
manager will now hold the write lock (compaction_state::lock)
when retrieving the sstable list, feeding them into compaction
strategy, and finally registering selected sstables as compacting.
The last step prevents another concurrent task from picking the
same sstable. Previously, all those steps were atomic, but
we have seen stall in that area in large installations, so
futurization of that area would come sooner or later.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
In order to support incremental repair, we'll allow each
replica::compaction_group to have two logical compaction groups
(or logical sstable sets), one for repaired, another for unrepaired.
That means we have to adapt a few places to work with
compaction_group_view instead, such that no logical compaction
group is missed when doing table or tablet wide operations.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since there will be only one physical sstable set, it makes sense to move
backlog tracker to replica::compaction_group. With incremental repair,
it still makes sense to compute backlog accounting both logical sets,
since the compound backlog influences the overall read amplification,
and the total backlog across repaired and unrepaired sets can help
driving decisions like giving up on incremental repair when unrepaired
set is almost as large as the repaired set, causing an amplification
of 2.
Also it's needed for correctness because a sstable can move quickly
across the logical sets, and having one tracker for each logical
set could cause the sstable to not be erased in the old set it
belonged to;
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Since table_state is a view to a compaction group, it makes sense
to rename it as so.
With upcoming incremental repair, each replica::compaction_group
will be actually two compaction groups, so there will be two
views for each replica::compaction_group.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
The separatation of sstables into the logical repaired and unrepaired
virtual sets, requires some adjustments for certain tests, in particular
for those that look at number of compaction tasks or number of sstables.
The following tests need adjustment:
* test/cluster/tasks/test_tablet_tasks.py
* test/boost/memtable_test.cc
The adjustments are done in such a way that they accomodate both the
case where there is separate repaired/unrepaired states and when there
isn't.
Add possibility to limit the execution time for one test in pytest
Add --session-timeout to limit execution of the test.py or/and pytest
session
Closesscylladb/scylladb#25185
* Fix discovery of application default credentials by using fully expanded pathnames (no tildes).
* Fix grant type in token request with user credentials.
Fixes#25345.
Closesscylladb/scylladb#25351
* github.com:scylladb/scylladb:
encryption: gcp: Fix the grant type for user credentials
encryption: gcp: Expand tilde in pathnames for credentials file
With greedy matching, an sstable path in a snapshot
directory with a tag that resembles a name-<uuid>
would match the dir regular expression as the longest match,
while a non-greedy regular expression would correctly match
the real keyspace and table as the shortest match.
Also, add a regression unit test reproducing the issue and
validating the fix.
Fixes#25242
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25323
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Refs #22733
* No backport required
Closesscylladb/scylladb#25222
* github.com:scylladb/scylladb:
locator: abstract_replication_strategy: implement local_replication_strategy
locator: vnode_effective_replication_map: convert clone_data_gently to clone_gently
locator: abstract_replication_map: rename make_effective_replication_map
locator: abstract_replication_map: rename calculate_effective_replication_map
replica: database: keyspace: rename {create,update}_effective_replication_map
locator: effective_replication_map_factory: rename create_effective_replication_map
locator: abstract_replication_strategy: rename vnode_effective_replication_map_ptr et. al
locator: abstract_replication_strategy: rename global_vnode_effective_replication_map
keyspace: rename get_vnode_effective_replication_map
dht: range_streamer: use naked e_r_m pointers
storage_service: use naked e_r_m pointers
alternator: ttl: use naked e_r_m pointers
locator: abstract_replication_strategy: define is_local
We adjust most of the tests in `cqlpy/test_describe.py`
so that they work against both Scylla and Cassandra.
This PR doesn't cover all of them, just those I authored.
Refs scylladb/scylladb#11690
Backport: not needed. This is effectively a code cleanup.
Closesscylladb/scylladb#25060
* github.com:scylladb/scylladb:
test/cqlpy/test_describe.py: Adjust test_create_role_with_hashed_password_authorization to work with Cassandra
test/cqlpy/test_describe.py: Adjust test_desc_restore to work with Cassandra
test/cqlpy/test_describe.py: Mark Scylla-only tests as such
This is the next part in the BTI index project.
Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25154
Next part: implementing a trie cursor (the "set to key, step forwards, step backwards" thing) on top of the `node_reader` added here.
The new code added here is not used for anything yet, but it's posted as a separate PR
to keep things reviewably small.
This part implements the BTI trie node encoding, as described in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md#trie-nodes.
It contains the logic for encoding the abstract in-memory `writer_node`s (added in the previous PR)
into the on-disk format, and the logic for traversing the on-disk nodes during a read.
New functionality, no backporting needed.
Closesscylladb/scylladb#25317
* github.com:scylladb/scylladb:
sstables/trie: add tests for BTI node serialization and traversal
sstables/trie: implement BTI node traversal
sstables/trie: implement BTI serialization
utils/cached_file: add get_shared_page()
utils/cached_file: replace a std::pair with a named struct
Previous way of execution repeat was to launch pytest for each repeat.
That was resource consuming, since each time pytest was doing discovery
of the tests. Now all repeats are done inside one pytest process.
Backport for 2025.3 is needed, since this functionality is framework only, and 2025.3 affected with this slow repeats as well.
Closesscylladb/scylladb#25073
* github.com:scylladb/scylladb:
test.py: add repeats in pytest
test.py: add directories and filename to the log files
test.py: rename log sink file for boost tests
test.py: better error handling in boost facade
The `pull_github_pr.sh` script has been fetching the username
from the owner of the source branch.
The owner of the branch is not always the author of the PR.
For example the branch might come from a fork managed by organization
or group of people.
This lead to having the author in merge commits refered to as `null`
(if the name was not set for the group) or it mentioned a name
not belonging to the author of the patch.
Instead looking for the owner of the source branch, the script should
look for the name of the PR's author.
Closesscylladb/scylladb#25363
Otherwise it is accessed right when exiting the if block.
Add a unit test reproducing the issue and validating the fix.
Fixes#25325
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#25326
This patch sets, for alternator test suite, all 'alternator-*' loggers and 'paxos' logger to trace level. This should significantly ease debugging of failed tests, while it has no effect on test time and increases log size only by 7%.
This affects running alternator tests only with `test.py`, not with `test/alternator/run`.
Closes#24645Closesscylladb/scylladb#25327
Derive both vnode_effective_replication_map
and local_effective_replication_map from
static_effective_replication_map as both are static and per-keyspace.
However, local_effective_replication_map does not need vnodes
for the mapping of all tokens to the local node.
Note that everywhere_replication_strategy is not abstracted in a similar
way, although it could, since the plan is to get rid of it
once all system keyspaces areconverted to local or tablets replication
(and propagated everywhere if needed using raft group0)
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>