Following a number of similar code cleanup PR, this one aims to be the last one, definitely dropping flat from all reader and related names.
Similarly, v2 is also dropped from reader names, although it still persists in mutation_fragment_v2, mutation_v2 and related names. This won't change in the foreseeable future, as we don't have plans to drop mutation (the v1 variant).
The changes in this PR are entirely mechanical, mostly just search-and-replace.
Code cleanup, no backport required.
Closesscylladb/scylladb#24087
* github.com:scylladb/scylladb:
test/boost/mutation_reader_another_test: drop v2 from reader and related names
test/boost/mutation_reader: s/puppet_reader_v2/puppet_reader/
test/boost/sstable_datafile_test: s/sstable_reader_v2/sstable_mutation_reader/
test/boost/mutation_test: s/consumer_v2/consumer/
test/lib/mutation_reader_assertions: s/flat_reader_assertions_v2/mutation_reader_assertions/
readers/mutation_readers: s/generating_reader_v2/generating_reader/
readers/mutation_readers: s/delegating_reader_v2/delegating_reader/
readers/mutation_readers: s/empty_flat_reader_v2/empty_mutation_reader/
readers/mutation_source: s/make_reader_v2/make_mutation_reader/
readers/mutation_source: s/flat_reader_v2_factory_type/mutation_reader_factory/
readers/mutation_reader: s/reader_consumer_v2/mutation_reader_consumer/
mutation/mutation_compactor: drop v2 from compactor and related names
replica/table: s/make_reader_v2/make_mutation_reader/
mutation_writer: s/bucket_writer_v2/bucket_writer/
readers/queue: drop v2 from reader and related names
readers/multishard: drop v2 from reader and related names
readers/evictable: drop v2 from reader and related names
readers/multi_range: remove flat from name
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.
Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault
In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.
Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.
Fixes#23162.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#24058
This is passed by reference to the constructor, but a copy is saved into
the _table_shared_data member. A reference to this member is passed down
to all memtable readers. Because of the copy, the memtable readers save
a reference to the memtable_list's member, which goes away together with
the memtable_list when the storage_group is destroyed.
This causes use-after-free when a storage group is destroyed while a
memtable read is still ongoing. The memtable reader keeps the memtable
alive, but its reference to the memtable_table_shared_data becomes
stale.
Fix by saving a reference in the memtable_list too, so memtable readers
receive a reference pointing to the original replica::table member,
which is stable accross tablet migrations and merges.
The copy was introduced by 2a76065e3d.
There was a copy even before this commit, but in the previous vnode-only
world this was fine -- there was one memtable_list per table and it was
around until the table itself was. In the tablet world, this is no
longer given, but the above commit didn't account for this.
A test is included, which reproduces the use-after-free on memtable
migration. The test is somewhat artificial in that the use-after-free
would be prevented by holding on to an ERM, but this is done
intentionaly to keep the test simple. Migration -- unlike merge where
this use-after-free was originally observed -- is easy to trigger from
unit tests.
Fixes: #23762Closesscylladb/scylladb#23984
Scylla operations use concurrency semaphores to limit the number of concurrent operations and prevent resource exhaustion. The semaphore is selected based on the current scheduling group.
For RAFT group operations, it is essential to use a system semaphore to avoid queuing behind user operations. This patch ensures that RAFT operations use the `gossip` scheduling group to leverage the system semaphore.
Fixesscylladb/scylladb#21637
Backport: 6.2 and 6.1
Closesscylladb/scylladb#22779
* github.com:scylladb/scylladb:
Ensure raft group0 RPCs use the gossip scheduling group
Move RAFT operations verbs to GOSSIP group.
Scylla operations use concurrency semaphores to limit the number
of concurrent operations and prevent resource exhaustion. The
semaphore is selected based on the current scheduling group.
For Raft group operations, it is essential to use a system semaphore to
avoid queuing behind user operations.
This commit adds a check to ensure that the raft group0 RPCs are
executed with the `gossiper` scheduling group.
The motivation behind this change to free up disk space as early as possible.
The reason is that snapshot locks the space of all SSTables in the snapshot,
and deleting form the table, for example, by compaction, or tablet migration,
won't free-up their capacity until they are uploaded to object storage and deleted from the snapshot.
This series adds prioritization of deleted sstables in two cases:
First, after the snapshot dir is processed, the list of SSTable generation is cross-referenced with the
list of SSTables presently in the table and any generation that is not in the table is prioritized to
be uploaded earlier.
In addition, a subscription mechanism was added to sstables_manager
and it is used in backup to prioritize SSTables that get deleted from the table directory
during backup.
This is particularly important when backup happens during high disk utilization (e.g. 90%).
Without it, even if the cluster is scaled up and tablets are migrated away from the full nodes
to new nodes, tablet cleanup might not free any space if all the tablet sstables are hardlinked to the
snapshot taken for backup.
* Enhancement, no backport needed
Closesscylladb/scylladb#23241
* github.com:scylladb/scylladb:
db: snapshot: backup_task: prioritize sstables deleted during upload
sstables_manager: add subscriptions
db: snapshot: backup_task: limit concurrency
sstables: directory_semaphore: expose get_units
db: snapshot: backup_task: add sharded sstables_manager
database: expose get_sstables_manager(schema)
db: snapshot: backup_task: do_backup: prioritize sstables that are already deleted from the table
db: snapshot-ctl: pass table_id to backup_task
db: snapshot-ctl: expose sharded db() getter
db: snapshot: backup_task: do_backup: organize components by sstable generation
db: snapshot: coroutinize backup_task
db: snapshot: backup_task: refactor backup_file out of uploads_worker
db: snapshot: backup_task: refactor uploads_worker out of do_backup
db: snapshot: backup_task: process_snapshot_dir: initialize total progress
utils/s3: upload_progress: init members to 0
db: snapshot: backup_task: do_backup: refactor process_snapshot_dir
db: snapshot: backup_task: keep expection as member
The row cache can garbage-collect tombstones in two places:
1) When populating the cache - the underlying reader pipeline has a `compacting_reader` in it;
2) During reads - reads now compact data including garbage collection;
In both cases, garbage collection has to do overlap checks against memtables, to avoid collecting tombstones which cover data in the memtables.
This PR includes fixes for (2), which were not handled at all currently.
(1) was already supposed to be fixed, see https://github.com/scylladb/scylladb/issues/20916. But the test added in this PR showed that the test is incomplete: https://github.com/scylladb/scylladb/issues/23291. A fix for this issue is also included.
Fixes: https://github.com/scylladb/scylladb/issues/23291
Fixes: https://github.com/scylladb/scylladb/issues/23252
The fix will need backport to all live release.
Closesscylladb/scylladb#23255
* github.com:scylladb/scylladb:
test/boost/row_cache_test: add memtable overlap check tests
replica/table: add error injection to memtable post-flush phase
utils/error_injection: add a way to set parameters from error injection points
test/cluster: add test_data_resurrection_in_memtable.py
test/pylib/utils: wait_for_cql_and_get_hosts(): sort hosts
replica/mutation_dump: don't assume cells are live
replica/database: do_apply() add error injection point
replica: improve memtable overlap checks for the cache
replica/memtable: add is_merging_to_cache()
db/row_cache: add overlap-check for cache tombstone garbage collection
mutation/mutation_compactor: copy key passed-in to consume_new_partition()
When running those operations after a tablet replica is migrated away from
a shard, an assert can fail resulting in a crash.
Status quo (around the assert in truncate procedure):
1) Highest RP seen by table is saved in low_mark, and the current time in
low_mark_at.
2) Then compaction is disabled in order to not mix data written before truncate,
and data written later.
3) Then memtable is flushed in order for the data written before truncate to be
available in sstables and then removed.
4) Now, current time is saved in truncated_at, which is supposedly the time of
truncate to decide which sstables to remove.
Note: truncated_at is likely above low_mark_at due to steps 2 and 3.
The interesting part of the assert is:
(truncated_at <= low_mark_at ? rp <= low_mark : low_mark <= rp)
Note: RP in the assert above is the highest RP among all sstables generated
before truncated_at. RP is retrieved by table::discard_sstables().
If truncated_at > low_mark_at, maybe newer data was written during steps 2 and
3, and memtable's RP becomes greater than low_mark, resulting in a SSTable with
RP > low_mark.
So assert's 2nd condition is there to defend against the scenario above.
truncated_at and low_mark_at uses millisecond granularity, so even if
truncated_at == low_mark_at, data could have been written in steps 2 and 3
(during same MS window), failing the assert. This is fragile.
Reproducer:
To reproduce the problem, truncated_at must be > low_mark_at, which can easily
happen with both drop table and truncate due to steps 2 and 3.
If a shard has 2 or more tablets, the table's highest RP refer to just one
tablet in that shard.
If the tablet with the highest RP is migrated away, then the sstables in that
shard will have lower RP than the recorded highest RP (it's a table wide state,
which makes sense since CL is shared among tablets).
So when either drop table or truncate runs, low_mark will be potentially bigger
than highest RP retrieved from sstables.
Proposed solution:
The current assert is hacked to not fail if writes sneak in, during steps 2 and
3, but it's still fragile and seems not to serve its real purpose, since it's
allowing for RP > low_mark.
We should be able to say that low_mark >= RP, as a way of asserting we're not
leaving data targeted by truncate behind (or that we're not removing the wrong
data).
But the problem is that we're saving low_mark in step 1, before preparation
steps (2 and 3). When truncated_at is recorded in step 4, it's a way of saying
all data written so far is targeted for removal. But as of today, low_mark
refers to all data written up to step 1. So low_mark is now only one set
before issuing flush, and also accounts for all potentially flushed data.
Fixes#18059.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23560
The current memtable overlap check that is used by the cache
-- table::get_max_purgeable_fn_for_cache_underlying_reader() -- only
checks the active memtable, so memtables which are either being flushed
or are already flushed and also have active reads against them do not
participate in the overlap check.
This can result in temporary data resurrection, where a cache read can
garbage-collect a tombstone which still covers data in a flushing or
flushed memtable, which still have active read against it.
To prevent this, extend the overlap check to also consider all of the
memtable list. Furthermore, memtable_list::erase() now places the removed
(flushed) memtable in an intrusive list. These entries are alive only as
long as there are readers still keeping an `lw_shared_ptr<memtable>`
alive. This list is now also consulted on overlap checks.
There are two snapshot-on-all-shards methods on the database -- the one
that snapshots a keyspace and the one that snapshots a vector of tables.
The latter snapshots a single table with a neat helper, while the former
has the helper open-coded.
Re-using the helper in keyspace snapshot is worth it, but needs to patch
the helper to work on uuid, rather than ks:cf pair of strings.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#23532
We want a method that will allow us to take a stable snapshot of
SSTables, to asynchronously compute some stats on them.
But `take_storage_snapshot` is overly invasive for that, because
it flushes memtables on each call.
(If `take_storage_snapshot` was, for example, called repetitively,
it could create a ton of small memtables and lead to trouble).
This commit adds a weaker version which only takes a snapshot of
*existing SSTables*, and doesn't flush memtables by itself.
This will be useful for dictionary training, which doesn't
care about the semantics of SSTables, only their rough statistical
properties.
Create a `sstable_compressor_factory_impl` in `scylla_main`,
and pipe it through constructors into `sstables_manager`.
In next commits, the factory available through the `sstables_manager`
will be used to create compressors for SSTable readers and writers.
`safe_foreach_sstable` doesn't do its job correctly.
It iterates over an sstable set under the sstable deletion
lock in an attempt to ensure that SSTables aren't deleted during the iteration.
The thing is, it takes the deletion lock after the SSTable set is
already obtained, so SSTables might get unlinked *before* we take the lock.
Remove this function and fix its usages to obtain the set and iterate
over it under the lock.
Closesscylladb/scylladb#23397
This PR is an introductory step towards enforcing
RF-rack-valid keyspaces in Scylla.
The scope of changes:
* defining RF-rack-valid keyspaces,
* introducing a configuration option enforcing RF-rack-valid
keyspaces,
* restricting the CREATE and ALTER KEYSPACE statements
so that they never lead to RF-rack invalid keyspaces,
* during the initialization of a node, it verifies that all existing
keyspaces are RF-rack-valid. If not, the initialization fails.
We provide tests verifying that the changes behave as intended.
---
Note that there are a number of things that still need to be implemented.
That includes, for instance, restricting topology operations too.
---
Implementation strategy (going beyond the scope of this PR):
1. Introduce the new configuration option `rf_rack_valid_keyspaces`.
2. Start enforcing RF-rack-validity in keyspaces if the option is enabled.
3. Adjust the tests: in the tree and out of it. Explicitly enable the option in all tests.
4. Once the tests have been adjusted, change the default value of the option to enabled.
5. Stop explicitly enabling the option in tests.
6. Get rid of the option.
---
Fixesscylladb/scylladb#20356Fixesscylladb/scylladb#23276Fixesscylladb/scylladb#23300
---
Backport: this is part of the requirements for releasing 2025.1.
Closesscylladb/scylladb#23138
* github.com:scylladb/scylladb:
main: Refuse to start node when RF-rack-invalid keyspace exists
cql3: Ensure that CREATE and ALTER never lead to RF-rack-invalid keyspaces
db/config: Introduce RF-rack-valid keyspaces
When a node is started with the option `rf_rack_valid_keyspaces`
enabled, the initialization will fail if there is an RF-rack-invalid
keyspace. We want to force the user to adjust their existing
keyspaces when upgrading to 2025.* so that the invariant that
every keyspace is RF-rack-valid is always satisfied.
Fixesscylladb/scylladb#23300
Actually, the main goal of this PR was to remove parse_tables() helpers from api/ in favor of more flexible (yet same complex) parse_table_infos(), but it turned out that it also saves some lookups in database maps.
There are several places in API and schema_tables that have table_id at hand, but at some point drop it and carry keyspace and table names over to a place that maps ks:cf back to table_id and then uses it to find the table object. This PR keeps the table_id with the help of table_info struct in those places. This change allows removing the aforementioned parse_table() helpers from api/ and also saves few lookups in database maps.
Removing the parse_tables() from api/ is the continuation of previous effort that reduces the set of helpers in api/ code that help handlers "parse" keyspaces and tables names see #22742#21533Closesscylladb/scylladb#23216
* github.com:scylladb/scylladb:
api: Remove the remaining parse_tables() overload
database: Sanitize flush_tables_on_all_shards()
schema_tables: Remove all_table_names()
database: Make tables flushing helper use table_info-s, not names
api: Make keyspace flush endpoint use parse_table_infos() (and a bit more)
schema_tables,client_state: Switch to using all_table_infos()
schema_tables: Tune up some methods to benefit from table_infos
schema_tables: Introduce all_table_infos()
This series adds an async guard to system_keyspace operations
and adds a deferred action to stop the system_keyspace in main() before destroying the service.
This helps to make sure that sys_ks is unplugged from its users and that all async operations using it are drained once it's stopped.
* Enhancement, no backport needed
Closesscylladb/scylladb#23113
* github.com:scylladb/scylladb:
main: stop system keyspace
system_keyspace: call shutdown from stop
system_keyspace: shutdown: allow calling more than once
database, compaction_manager, large_data_handler: use pluggable<system_keysapce>
utils: add class pluggable
Previous patch left this method with few uglinesses
- the vector<table_id> argument is named table_names
- the sstring keyspace argument is unused
- the keyspace argument is captured for no use
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The database::flush_tables_on_all_shards() method accepts a keyspace
name and a vector of table names. Then it converts ks:cf pair for each
of the table name into a table-id and flushes the table with the ID.
All the callers of that method already have or can easily get the vector
of table_id-s, not just names, so make use of this.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
There are two semaphores in table for synchronizing changes to sstable list:
sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.
sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.
they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.
problem:
A = tablet cleanup
B = take_snapshot()
1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk
Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.
But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.
That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.
The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.
So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.
Fixes#23049.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#23117
To allow safe plug and unplug of the system_keyspace.
This patch follows-up on 917fdb9e53
(more specifically - f9b57df471)
Since just keeping a shared_ptr<system_keyspace> doesn't prevent
stopping the system_keyspace shards, while using the `pluggable`
interface allows safe draining of outstanding async calls
on shutdown, before stopping the system_keyspace.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Currently for system keyspace part of config members are configured
outside of this helper, in the caller. It's more consistent to have full
config initialization in one place.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Currently, when reducing RF, we may drop replicas from
the view before dropping replicas from the base table.
Since get_view_natural_endpoint is allowed to return
a disengaged optional if it can't find a pair for the
base replica, replcace the exiting assertion with code
handling this case, and count those events in a new
table metric: total_view_updates_failed_pairing.
Note that this does not fix the root cause for the issue
which is the unsynchronized dropping of replicas, that
should be atomic, using a single group0 transaction.
Refs scylladb/scylladb#21492
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).
Users can see running resize tasks - finished tasks are not
presented with the task manager API.
A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.
Fixes: #21366.
Fixes: #21367.
No backport, new feature
Closesscylladb/scylladb#21891
* github.com:scylladb/scylladb:
test: boost: check resize_task_info in tablet_test.cc
test: add tests to check revoked resize virtual tasks
test: add tests to check the list of resize virtual tasks
test: add tests to check spilt and merge virtual tasks status
test: test_tablet_tasks: generalize functions
replica: service: add split virtual task's children
replica: service: pass parent info down to storage_group::split
tasks: children of virtual tasks aren't internal by default
tasks: initialize shard in task_info ctor
service: extend tablet_virtual_task::abort
service: retrun status_helper struct from tablet_virtual_task::get_status_helper
service: extend tablet_virtual_task::wait
tasks: add suspended task state
service: extend tablet_virtual_task::get_status
service: extend tablet_virtual_task::contains
service: extend tablet_virtual_task::get_stats
service: add service::task_manager_module::get_nodes
tasks: add task_manager::get_nodes
tasks: drop noexcept from module::get_nodes
replica: service: add resize_task_info static column to system.tablets
locator: extend tablet_task_info to cover resize tasks
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.
Fixes: #21480Closesscylladb/scylladb#22251
these unused includes were identifier by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.
please note, because quite a few source files relied on
`utils/to_string.hh` to pull in the specialization of
`fmt::formatter<std::optional<T>>`, after removing
`#include <fmt/std.h>` from `utils/to_string.hh`, we have to
include `fmt/std.h` directly.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Pass task_info down to storage_group::split.
In the following patches, it will be used to set the parent
of offstrategy_compaction_task_executor and split_compaction_task_executor
running as a part of the split. The task_info param will contain task
info of a split virtual task.
Replace the reader concurrency semaphores for user reads and view
updates with the newly introduced reader concurrency semaphore group,
which assigns a semaphore for each service level.
Each group is statically assigned to some pool of memory on startup and
dynamically distribute this memory between the semaphores, relative to
the number of shares of the corresponding scheduling group.
The intent of having a separate reader concurrency semaphore for each
scheduling group is to prevent priority inversion issues due to reads
with different priorities waiting on the same semaphore, as well as make
memory allocation more fair between service levels due to the adjusted
number of shares.
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().
- storage_service: replicate_to_all_cores: update base and view tables atomically
Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)
To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.
* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2
Closesscylladb/scylladb#21781
* github.com:scylladb/scylladb:
storage_service: replicate_to_all_cores: clear_gently pending erms
test_mv_topology_change: drop delay_after_erm_update injection case
storage_service: replicate_to_all_cores: update base and view tables atomically
table: make update_effective_replication_map sync again
To reduce test executable size and speed up compilation time, compile unit
tests into a single executable.
Here is a file size comparison of the unit test executable:
- Before applying the patch
$ du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
11G build/release/test/boost/
29G build/debug/test/boost/
- After applying the patch
du -h --exclude='*.o' --exclude='*.o.d' build/release/test/boost/ build/debug/test/boost/
5.5G build/release/test/boost/
19G build/debug/test/boost/
It reduces executable sizes 5.5GB on release, and 10GB on debug.
Closes#9155Closesscylladb/scylladb#21443
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21836
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.
Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786
This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a memebr and
storage_group_manager::stop() consumes this future during table::stop().
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
"
This rather large patch series moves storage proxy and some adjacent
services (like migration manager) to use host ids to identify nodes rather
than ips. Messaging service gains a capability to address nodes by host
ids (which allows dropping translations from topology coordinator code
that worked on host ids already) and also makes sure that a node with
incorrect host id will reject a message (can happen during address
changes).
The series gets rid of the raft address map completely and replaces it with
the gossiper address map which is managed by the gossiper since translation
is now done in the layer below raft.
Fixes: scylladb/scylladb#6403
perf-simple-query -- smp 1 -m 1G output
Before:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
64336.82 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41291 insns/op, 24485 cycles/op, 0 errors)
62669.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41277 insns/op, 24695 cycles/op, 0 errors)
69172.12 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41326 insns/op, 24463 cycles/op, 0 errors)
56706.60 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41143 insns/op, 24513 cycles/op, 0 errors)
56416.65 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 41186 insns/op, 24851 cycles/op, 0 errors)
throughput: mean=61860.35 standard-deviation=5395.48 median=62669.58 median-absolute-deviation=5153.75 maximum=69172.12 minimum=56416.65
instructions_per_op: mean=41244.62 standard-deviation=76.90 median=41276.94 median-absolute-deviation=58.55 maximum=41326.19 minimum=41142.80
cpu_cycles_per_op: mean=24601.35 standard-deviation=167.39 median=24512.64 median-absolute-deviation=116.65 maximum=24851.45 minimum=24462.70
After:
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, frontend=cql, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
65237.35 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 40733 insns/op, 23145 cycles/op, 0 errors)
59283.09 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40624 insns/op, 23948 cycles/op, 0 errors)
70851.03 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40625 insns/op, 23027 cycles/op, 0 errors)
70549.61 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40650 insns/op, 23266 cycles/op, 0 errors)
68634.96 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 40622 insns/op, 22935 cycles/op, 0 errors)
throughput: mean=66911.21 standard-deviation=4814.60 median=68634.96 median-absolute-deviation=3638.40 maximum=70851.03 minimum=59283.09
instructions_per_op: mean=40650.89 standard-deviation=47.55 median=40624.60 median-absolute-deviation=27.11 maximum=40733.37 minimum=40622.33
cpu_cycles_per_op: mean=23264.16 standard-deviation=402.12 median=23145.29 median-absolute-deviation=237.63 maximum=23947.96 minimum=22934.59
CI: https://jenkins.scylladb.com/job/scylla-master/job/scylla-ci/13531/
SCT (longevity-100gb-4h with nemesis_selector: ['topology_changes']): https://jenkins.scylladb.com/view/staging/job/scylla-staging/job/gleb/job/move-to-host-id/3/
Tested mixed cluster manually.
"
* 'gleb/move-to-host-id-v2' of github.com:scylladb/scylla-dev: (55 commits)
group0: drop unused field from replace_info struct
test: rename raft_address_map_test to address_map_test and move if from raft tests
raft_address_map: remove raft address map
topology coordinator: do not modify expire state for left/new nodes any more in raft address map
topology coordinator: drop expiring entries in gossiper address map on error injections since raft one is no longer used
group0: drop raft address map dependency from raft_rpc
group0: move raft_ticker_type definition from raft_address_map.hh
storage_service: do not update raft address map on gossiper events
group0: drop raft address map dependency from raft_server_with_timeouts
group0: move group0 upgrade code to host ids
repair: drop raft address map dependency
group0: remove unused raft address map getter from raft_group0
group0: drop raft address map from group0_state_machine dependency since it is not used there any more
group0: remove dependency on raft address map from group0_state_id_handler
gossiper: add get_application_state_ptr that searches by host_id
gossiper: change get_live_token_owners to return host ids
view: move view building to host id
hints: use host id to send hints
storage_proxy: remove id_vector_to_addr since it is no longer used
db: consistency_level: change is_sufficient_live_nodes to work on host ids
...
Scrub compaction can pick up input sstables from maintenance sstable set
but on compaction completion, it doesn't update the maintenance set
leaving the original sstable in set after it has been scrubbed. To fix
this, on compaction completion has to update the maintenance sstable if
the input originated from there. This PR solves the issue by updating the
correct sstable_sets on compaction completion.
Fixes#20030
This issue has existed since the introduction of main and maintenance sstable sets into scrub compaction. It would be good to have the fix backported to versions 6.1 and 6.2.
Closesscylladb/scylladb#21582
* github.com:scylladb/scylladb:
compaction: remove unused `update_sstable_lists_on_off_strategy_completion`
compaction_group: replace `update_sstable_lists_on_off_strategy_completion`
compaction_group: rename `update_main_sstable_list_on_compaction_completion`
compaction_group: update maintenance sstable set on scrub compaction completion
compaction_group: store table::sstable_list_builder::result in replacement_desc
table::sstable_list_builder: remove old sstables only from current list
table::sstable_list_builder: return removed sstables from build_new_list
Updated the method table::sstable_list_builder::build_new_list() to
return the list of sstables that was removed along with the newly built
sstable set. This change will be used to unify the
`update_sstable_lists` variants in a following patch.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Stop taking snapshots of MVs and allow taking snapshot of individual tables, now one can take a snapshot of any base table, any view or index. Also add tests to cover new cases both boost test (using cc code) and pytest (using the API)
Also, update documentation to reflect the change
fixes: #21339fixes: #20760Closesscylladb/scylladb#21433
Modernize the codebase by replacing Boost range adaptors with C++23 standard library views,
reducing external dependencies and leveraging modern C++ language features.
Key Changes:
- Replace `boost::adaptors::filtered` with `std::views::filter`
- Remove `#include <boost/range/adaptor/filtered.hpp>`
- Utilize standard library range views
Motivation:
- Reduce project's external dependency footprint
- Leverage standard library's range and view capabilities
- Improve long-term code maintainability
- Align with modern C++ best practices
Implementation Challenges and Considerations:
1. Range Conversion and Move Semantics
- `std::ranges::to` adaptor requires rvalue references
- Necessitated updates to variable and parameter constness
- Example: `cql3/restrictions/statement_restrictions.cc` modified to remove `const`
from `common` to enable efficient range conversion
2. Range Iteration and Mutation
- Range views may mutate internal state during iteration
- Cannot pass ranges by const reference in some scenarios
- Solution: Pass ranges by rvalue reference to explicitly indicate
state invalidation
Limitations:
- One instance of `boost::adaptors::filtered` temporarily preserved
due to lack of a C++23 alternative for `boost::join()`
- A comprehensive replacement will be addressed in a follow-up change
This change is part of our ongoing effort to modernize the codebase,
reducing external dependencies and adopting modern C++ practices.
Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
Closesscylladb/scylladb#21648
Expose the functionality of `tablet_map::get_token_range_after_split()`
via the replica::table class.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>