Commit Graph

88 Commits

Author SHA1 Message Date
Raphael S. Carvalho
28056344ba replica: Fix take_storage_snapshot() running concurrently to merge completion
Some background:
When merge happens, a background fiber wakes up to merge compaction
groups of sibling tablets into main one. It cannot happen when
rebuilding the storage group list, since token metadata update is
not preemptable. So a storage group, post merge, has the main
compaction group and two other groups to be merged into the main.
When the merge happens, those two groups are empty and will be
freed.

Consider this scenario:
1) merge happens, from 2 to 1 tablet
2) produces a single storage group, containing main and two
other compaction groups to be merged into main.
3) take_storage_snapshot(), triggered by migration post merge,
gets a list of pointer to all compaction groups.
4) t__s__s() iterates first on main group, yields.
5) background fiber wakes up, moves the data into main
and frees the two groups
6) t__s__s() advances to other groups that are now freed,
since step 5.
7) segmentation fault

In addition to memory corruption, there's also a potential for
data to escape the iteration in take_storage_snapshot(), since
data can be moved across compaction groups in background, all
belonging to the same storage group. That could result in
data loss.

Readers should all operate on storage group level since it can
provide a view on all the data owned by a tablet replica.
The movement of sstable from group A to B is atomic, but
iteration first on A, then later on B, might miss data that
was moved from B to A, before the iteration reached B.
By switching to storage group in the interface that retrieves
groups by token range, we guarantee that all data of a given
replica can be found regardless of which compaction group they
sit on.

Fixes #23162.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24058
2025-05-09 14:07:06 +03:00
Raphael S. Carvalho
21d1e78457 compaction: Wire table_state into make_sstable_set()
This will be useful for feeding token range owned by compaction group
into sstable set.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-04-29 15:47:33 -03:00
Benny Halevy
52e1ce7f0d replica: compaction_group, storage_group: use named gate
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-04-12 11:28:49 +03:00
Raphael S. Carvalho
fedd838b9d replica: Fix race of some operations like cleanup with snapshot
There are two semaphores in table for synchronizing changes to sstable list:

sstable_set_mutation_sem: used to serialize two concurrent operations updating
the list, to prevent them from racing with each other.

sstable_deletion_sem: A deletion guard, used to serialize deletion and
iteration over the list, to prevent iteration from finding deleted files on
disk.

they're always taken in this order to avoid deadlocks:
sstable_set_mutation_sem -> sstable_deletion_sem.

problem:

A = tablet cleanup
B = take_snapshot()

1) A acquires sstable_set_mutation_sem for updating list
2) A acquires sstable_deletion_sem, then delete sstable before updating list
3) A releases sstable_deletion_sem, then yield
4) B acquires sstable_deletion_sem
5) B iterates through list and bumps sstable deleted in step 2
6) B fails since it cannot find the file on disk

Initial reaction is to say that no procedure must delete sstable before
updating the list, that's true.

But we want a iteration, running concurrently to cleanup, to not find sstables
being removed from the system. Otherwise, e.g. snapshot works with sstables
of a tablet that was just cleaned up. That's achieved by serializing iteration
with list update.
Since sstable_deletion_sem is used within the scope of deletion only, it's
useless for achieving this. Cleanup could acquire the deletion sem when
preparing list updates, and then pass the "permit" to deletion function, but
then sstable_deletion_sem would essentially become sstable_set_mutation_sem,
which was created exactly to protect the list update.

That being said, it makes sense to merge both semaphores. Also things become
easier to reason about, and we don't have to worry about deadlocks anymore.

The deletion goes through sstable_list_builder, which holds a permit throughout
its lifetime, which guarantees that list updates and deletion are atomic to
other concurrent operations. The interface becomes less error prone with that.
It allowed us to find discard_sstables() was doing deletion without any permit,
meaning another race could happen between truncate and snapshot.

So we're fixing race of (truncate|cleanup) with take_snapshot, as far as we
know. It's possible another unknown races are fixed as well.

Fixes #23049.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#23117
2025-03-06 11:00:48 +02:00
Kefu Chai
57b14220ce tree: remove unused "#include"s
these unused includes were identified by clang-include-cleaner. after
auditing these source files, all of the reports have been confirmed.

in which, instead of using `seastarx.hh`, `readers/mutation_reader.hh`,
use `using seastar::future` to include `future` in the global namespace,
this makes `readers/mutation_reader.hh` a header exposing `future<>`,
but this is not a good practice, because, unlike `seastarx.hh` or
`seastar/core/future.hh`, `reader/mutation_reader.hh`  is not
responsible for exposing seastar declarations. so, we trade the
using statement for `#include "seastarx.hh"` in that file to decouple
the source files including it from this header because of this statement.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#22439
2025-01-28 14:12:06 +03:00
Botond Dénes
47989b1503 Merge 'tasks: add tablet resize virtual task' from Aleksandra Martyniuk
In this change, tablet_virtual_task starts supporting tablet
resize (i.e. split and merge).

Users can see running resize tasks - finished tasks are not
presented with the task manager API.

A new task state "suspended" is added. If a resize was revoked,
it will appear to users as suspended. We assume that the resize was revoked
when the tablet number didn't change.

Fixes: #21366.
Fixes: #21367.

No backport, new feature

Closes scylladb/scylladb#21891

* github.com:scylladb/scylladb:
  test: boost: check resize_task_info in tablet_test.cc
  test: add tests to check revoked resize virtual tasks
  test: add tests to check the list of resize virtual tasks
  test: add tests to check spilt and merge virtual tasks status
  test: test_tablet_tasks: generalize functions
  replica: service: add split virtual task's children
  replica: service: pass parent info down to storage_group::split
  tasks: children of virtual tasks aren't internal by default
  tasks: initialize shard in task_info ctor
  service: extend tablet_virtual_task::abort
  service: retrun status_helper struct from tablet_virtual_task::get_status_helper
  service: extend tablet_virtual_task::wait
  tasks: add suspended task state
  service: extend tablet_virtual_task::get_status
  service: extend tablet_virtual_task::contains
  service: extend tablet_virtual_task::get_stats
  service: add service::task_manager_module::get_nodes
  tasks: add task_manager::get_nodes
  tasks: drop noexcept from module::get_nodes
  replica: service: add resize_task_info static column to system.tablets
  locator: extend tablet_task_info to cover resize tasks
2025-01-17 14:24:07 +02:00
Botond Dénes
55963f8f79 replica: remove noexcept from token -> tablet resolution path
The methods to resolve a key/token/range to a table are all noexcept.
Yet the method below all of these, `storage_group_for_id()` can throw.
This means that if due to any mistake a tablet without local replica is
attempted to be looked up, it will result in a crash, as the exception
bubbles up into the noexcept methods.
There is no value in pretending that looking up the tablet replica is
noexcept, remove the noexcept specifiers so that any bad lookup only
fails the operation at hand and doesn't crash the node. This is
especially relevant to replace, which still has a window where writes
can arrive for tablets that don't (yet) have a local replica. Currently,
this results in a crash. After this patch, this will only fail the
writes and the replace can move on.

Fixes: #21480

Closes scylladb/scylladb#22251
2025-01-17 11:24:09 +03:00
Aleksandra Martyniuk
7ef6900837 replica: service: pass parent info down to storage_group::split
Pass task_info down to storage_group::split.

In the following patches, it will be used to set the parent
of offstrategy_compaction_task_executor and split_compaction_task_executor
running as a part of the split. The task_info param will contain task
info of a split virtual task.
2025-01-10 10:03:08 +01:00
Avi Kivity
4905b1bf76 Merge 'table: make update_effective_replication_map sync again' from Benny Halevy
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a member and
storage_group_manager::stop() consumes this future during table::stop().

- storage_service: replicate_to_all_cores: update base and view tables atomically

Currently, the loop updating all tables (including views) with the
new effective_replication_map may yield, and therefore expose
a state where the base and view tables effective_replication_map
and topology are out of sync (as seen in scylladb/scylladb#17786)

To prevent that, loop over all base tables and for each table
update the base table and all views atomically, without yielding,
and so allow yielding only between base tables.

* Regression was introduced in f2ff701489, so backport is required to 6.x, 2024.2

Closes scylladb/scylladb#21781

* github.com:scylladb/scylladb:
  storage_service: replicate_to_all_cores: clear_gently pending erms
  test_mv_topology_change: drop delay_after_erm_update injection case
  storage_service: replicate_to_all_cores: update base and view tables atomically
  table: make update_effective_replication_map sync again
2024-12-30 23:42:06 +02:00
Avi Kivity
f3eade2f62 treewide: relicense to ScyllaDB-Source-Available-1.0
Drop the AGPL license in favor of a source-available license.
See the blog post [1] for details.

[1] https://www.scylladb.com/2024/12/18/why-were-moving-to-a-source-available-license/
2024-12-18 17:45:13 +02:00
Kefu Chai
e65fc35b5e replica: do not include unused headers
these unused includes are identified by clang-include-cleaner. after
auditing the source files, all of the reports have been confirmed.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>

Closes scylladb/scylladb#21836
2024-12-18 13:52:57 +02:00
Raphael S. Carvalho
013e0d53ff replica: Fix use-after-free due to a race between split and cleanup
There is an assumption that every destroyed compaction_group will be stopped first.
Otherwise, the group is still referenced by compaction manager and can use it after
freed. That's what happened in issue #21867 in the context of merge.

The issue is pre-existing but was made more likely with merge.

One problem is a race between split and cleanup, where if split is emitted while
cleanup is stopping groups, it can happen split preparation adds new groups that will
never be closed, since cleanup is already past the group stopping step.

Another problem found is that split completion handler is not accounting for possible
existence of merging groups, if split happens right after merge. Split completion
handler should stop all empty groups that previously had data split from them.

The problems will be fixed by guaranteeing that new groups will not be added for a
tablet being migrated away, and that empty groups are properly closed when handling
split completion.

A reproducer was added.

Fixes #21867.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#21920
2024-12-16 13:19:26 +02:00
Benny Halevy
10c4cf930c table: make update_effective_replication_map sync again
Commit f2ff701489 introduced
a yield in update_effective_replication_map that might
cause the storage_group manager to be inconsistent with the
new effective_replication_map (e.g. if yielding right
before calling `handle_tablet_split_completion`.

Also, yielding inside storage_service::replicate_to_all_cores
update loop means that base tables and their views
aren't updated atomically, that caused scylladb/scylladb#17786

This change essentially reverts f2ff701489
and makes handle_tablet_split_completion synchronous too.
The stopped compaction groups future is kept as a memebr and
storage_group_manager::stop() consumes this future during table::stop().

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-12-15 11:45:08 +02:00
Raphael S. Carvalho
70b3963b8d replica: Implement merging of compaction groups on merge completion
When handling merge completion, compaction groups that belonged to
sibling tablets are placed into the same storage group, since those
tablets become one after merge.

In order to merge two groups, the source group needs its memtable to
be flushed first, such that all the data can be moved into the
destination.

The handling happens in update_effective_replication_map() which cannot
afford to wait for I/O, so the group merge will happen in background.
There's a fiber that will wake up on merge completion and will iterate
through the new set of storage groups (after merge), and will work
on merging additional compaction groups into the main one.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Raphael S. Carvalho
907739f3d1 replica: Handle tablet merge completion
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-12-03 21:47:01 -03:00
Lakshmi Narayanan Sreethar
91148e7747 compaction: remove unused update_sstable_lists_on_off_strategy_completion
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
4f7b1c0bcd compaction_group: rename update_main_sstable_list_on_compaction_completion
Rename `update_main_sstable_list_on_compaction_completion` to
`update_sstable_sets_on_compaction_completion` as the method updates
both main and maintenance sstable sets now.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-28 11:25:11 +05:30
Lakshmi Narayanan Sreethar
c4db4abcae replica/table: implement get_token_range_after_split() wrappers
Expose the functionality of `tablet_map::get_token_range_after_split()`
via the replica::table class.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-11-11 12:24:00 +05:30
Lakshmi Narayanan Sreethar
6a357b55e3 compaction_group: track maximum timestamp across all sstables
This will be used in a following patch to decide if the compacting
reader has to check the memtables before purging a tombstone.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2024-10-18 19:19:11 +05:30
Ferenc Szili
e472844a78 database::table: add tombstone_gc_enabled(locator::tablet_id)
This change adds the flag tombstone_gc_enabled to compaction_group.
The value of this flag will be set in
tablet_storage_group_manager::update_effective_replication_map().
2024-10-02 16:24:45 +02:00
Raphael S. Carvalho
cf58674029 replica: Fix schema change during migration cleanup
During migration cleanup, there's a small window in which the storage
group was stopped but not yet removed from the list. So concurrent
operations traversing the list could work with stopped groups.

During a test which emitted schema changes during migrations,
a failure happened when updating the compaction strategy of a table,
but since the group was stopped, the compaction manager was unable
to find the state for that group.

In order to fix it, we'll skip stopped groups when traversing the
list since they're unused at this stage of migration and going away
soon.

Fixes #20699.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#20798
2024-09-30 17:30:38 +03:00
Botond Dénes
ea29fe579b Merge 'replica: ignore cleanup of deallocated storage group' from Aleksandra Martyniuk
Cleanup of a deallocated tablet throws an exception.
Since failed cleanup is retried, we end up in an infinite loop.

Ignore cleanup of deallocated storage groups.

Fixes:  #19752.

Needs to be backported to all branches with tablets (6.0 and later)

Closes scylladb/scylladb#20584

* github.com:scylladb/scylladb:
  test: check if cleanup of deallocated sg is ignored
  replica: ignore cleanup of deallocated storage group
2024-09-16 09:22:56 +03:00
Aleksandra Martyniuk
20d6cf55f2 replica: ignore cleanup of deallocated storage group
Currently, attempt to cleanup deallocated storage group throws
an exception. Failed tablet cleanup is retried, stucking
in an endless loop.

Ignore cleanup of deallocated storage group.
2024-09-13 13:00:53 +02:00
Benny Halevy
6f202cf48b compaction_group, storage_group, table_state: add extended timestamp stats getters
To return the minimum live timestamp and live row-marker
timestamp across a compaction_group, storage_group, or
table_state.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-09-10 19:05:57 +03:00
Aleksandra Martyniuk
5005e19de7 compaction: fix compaction group stop reason
compaction_manager::remove passes "table removal" as a reason
of stopping ongoing compactions, but currently remove method
is also called when a tablet is migrated or split.

Pass the actual reason of compaction stop, so that logs aren't
misleading.
2024-08-21 12:42:09 +02:00
Raphael S. Carvalho
74612ad358 tablets: Fix race between repair and split
Consider the following:

T
0   split prepare starts
1                               repair starts
2   split prepare finishes
3                               repair adds unsplit sstables
4                               repair ends
5   split executes

If repair produces sstable after split prepare phase, the replica
will not split that sstable later, as prepare phase is considered
completed already. That causes split execution to fail as replicas
weren't really prepared. This also can be triggered with
load-and-stream which shares the same write (consumer) path.

The approach to fix this is the same employed to prevent a race
between split and migration. If migration happens during prepare
phase, it can happen source misses the split request, but the
tablet will still be split on the destination (if needed).
Similarly, the repair writer becomes responsible for splitting
the data if underlying table is in split mode. That's implemented
in replica::table for correctness, so if node crashes, the new
sstable missing split is still split before added to the set.

Fixes #19378.
Fixes #19416.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-12 17:28:51 -03:00
Raphael S. Carvalho
518677d7f9 replica: Fix comment about compaction group
there's not a 1:1 relationship between compaction group count and
tablet count. a tablet replica has a storage group instance, which
may map to multiple compaction groups during split mode.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-12 16:24:51 -03:00
Raphael S. Carvalho
f139aa1df6 replica: remove unused compaction_group_vector
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-12 16:16:47 -03:00
Raphael S. Carvalho
c539b7c861 replica: remove rwlock for protecting iteration over storage group map
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.

Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.

Fixes #18821.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-09 16:59:24 -03:00
Raphael S. Carvalho
ad5c5bca5f replica: get rid of fragile compaction group intrusive list
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.

Fixes #18506.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-07-09 16:53:35 -03:00
Kefu Chai
e87b64b7bb compaction: not include unused headers
these unused includes were identified by clangd. see
https://clangd.llvm.org/guides/include-cleaner#unused-include-warning
for more details on the "Unused include" warning.

Signed-off-by: Kefu Chai <kefu.chai@scylladb.com>
2024-07-02 14:06:42 +08:00
Avi Kivity
9ecf4ada49 compaction: compaction_group: define destructor out-of-line
Define compaction_group::~compaction_group() out-of-line to prevent
problems instantiating compaction_group::_table_state, which is an
std::unique_ptr. In C++23, std::unique_ptr is constexpr, which means
its methods (in this case the destructor) require seeing the definition
of the class at the point of instantiation.
2024-06-27 17:54:12 +03:00
Raphael S. Carvalho
6214dda506 replica: return storage_group by reference on storage_group_for*()
those functions cannot return nullptr, will throw when group is not
found, so better return ref instead.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-12 11:53:06 -03:00
Raphael S. Carvalho
9c1d3bcc02 replica: devirtualize storage_group_of()
can be made private to tablet_storage_group_manager.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-06-12 11:29:49 -03:00
Raphael S. Carvalho
7b41630299 replica: Refresh mutation source when allocating tablet replicas
Consider the following:

1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in

Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"

which means loss of write on pending replica.

The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.

Fixes #19052.
Fixes #19033.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#19099
2024-06-11 06:59:04 +03:00
Raphael S. Carvalho
abcc68dbe7 storage_service: Fix race between tablet split and stats retrieval
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.

It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).

Fixes #18085.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-22 09:25:29 -03:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Botond Dénes
afa870a387 Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho
Closes scylladb/scylladb#18616

* github.com:scylladb/scylladb:
  replica: Make it explicit table's sstable set is immutable
  replica: avoid reallocations in tablet_sstable_set
  replica: Avoid compound set if only one sstable set is filled
2024-05-13 14:17:24 +03:00
Raphael S. Carvalho
7faba69f28 replica: Make it explicit table's sstable set is immutable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 11:58:08 -03:00
Raphael S. Carvalho
35a0d47408 replica: Avoid compound set if only one sstable set is filled
Most of the time only main set is filled, so we can avoid one layer
of indirection (= compound set) when maintenance set is empty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk
b4371a0ea0 replica: allocate storage groups dynamically
Currently empty storage_groups are allocated for tablets that are
not on this shard.

Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
  shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is cleaned up.

Stop compaction group before it's deallocated.

Add a flag to table::cleanup_tablet deciding whether to deallocate
sgs and use it in commitlog tests.
2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk
c283746b32 replica: add rwlock to storage_group_manager
Add rwlock which prevents storage groups from being added/deleted
while some other layers itereates over them (or their compaction
groups).

Add methods to iterate over storage groups with the lock held.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
8505389963 replica: drop single_compaction_group_if_available
Drop single_compaction_group_if_available as it's unused.
2024-05-10 14:56:38 +02:00
Raphael S. Carvalho
9f93dd9fa3 replica: Use flat_hash_map for tablet storage
The reason that we want to switch to flat_hash_map is that only a small
subset of tablets will be allocated on any given shard, therefore it's
wasteful to use a sparse array, and iterations are slow.
Also, the map gives greater development flexibility as one doesn't have
to worry about empty entries.

perf result:

-- reads

scylla_with_chunked_vector-read-no-tablets.txt
median 73223.28 tps ( 62.3 allocs/op,  13.3 tasks/op,   41932 insns/op,        0 errors)
median 74952.87 tps ( 62.3 allocs/op,  13.3 tasks/op,   41969 insns/op,        0 errors)
median 73016.37 tps ( 62.3 allocs/op,  13.3 tasks/op,   41934 insns/op,        0 errors)
median 74078.14 tps ( 62.3 allocs/op,  13.3 tasks/op,   41938 insns/op,        0 errors)
median 75323.07 tps ( 62.3 allocs/op,  13.3 tasks/op,   41944 insns/op,        0 errors)

scylla_with_hash_map-read-no-tablets.txt
median 74963.30 tps ( 62.3 allocs/op,  13.3 tasks/op,   41926 insns/op,        0 errors)
median 74032.09 tps ( 62.3 allocs/op,  13.3 tasks/op,   41918 insns/op,        0 errors)
median 74850.09 tps ( 62.3 allocs/op,  13.3 tasks/op,   41937 insns/op,        0 errors)
median 74239.37 tps ( 62.3 allocs/op,  13.3 tasks/op,   41921 insns/op,        0 errors)
median 74798.14 tps ( 62.3 allocs/op,  13.3 tasks/op,   41925 insns/op,        0 errors)

scylla_with_chunked_vector-read-tablets-1.txt
median 74234.27 tps ( 62.1 allocs/op,  13.3 tasks/op,   41903 insns/op,        0 errors)
median 75775.98 tps ( 62.1 allocs/op,  13.3 tasks/op,   41910 insns/op,        0 errors)
median 76481.56 tps ( 62.1 allocs/op,  13.2 tasks/op,   41874 insns/op,        0 errors)
median 74056.67 tps ( 62.1 allocs/op,  13.3 tasks/op,   41894 insns/op,        0 errors)
median 75287.68 tps ( 62.1 allocs/op,  13.3 tasks/op,   41894 insns/op,        0 errors)

scylla_with_hash_map-read-tablets-1.txt
median 75613.63 tps ( 62.1 allocs/op,  13.2 tasks/op,   41990 insns/op,        0 errors)
median 74819.51 tps ( 62.1 allocs/op,  13.2 tasks/op,   41973 insns/op,        0 errors)
median 75648.41 tps ( 62.1 allocs/op,  13.3 tasks/op,   42025 insns/op,        0 errors)
median 74170.89 tps ( 62.1 allocs/op,  13.2 tasks/op,   42002 insns/op,        0 errors)
median 75447.72 tps ( 62.1 allocs/op,  13.3 tasks/op,   41952 insns/op,        0 errors)

scylla_with_chunked_vector-read-tablets-128.txt
median 73788.57 tps ( 62.1 allocs/op,  13.2 tasks/op,   41956 insns/op,        0 errors)
median 76563.63 tps ( 62.1 allocs/op,  13.3 tasks/op,   42006 insns/op,        0 errors)
median 75536.12 tps ( 62.1 allocs/op,  13.2 tasks/op,   42005 insns/op,        0 errors)
median 74679.17 tps ( 62.1 allocs/op,  13.3 tasks/op,   41958 insns/op,        0 errors)
median 75380.95 tps ( 62.1 allocs/op,  13.2 tasks/op,   41946 insns/op,        0 errors)

scylla_with_hash_map-read-tablets-128.txt
median 75459.99 tps ( 62.1 allocs/op,  13.3 tasks/op,   42055 insns/op,        0 errors)
median 74280.11 tps ( 62.1 allocs/op,  13.3 tasks/op,   42085 insns/op,        0 errors)
median 74502.61 tps ( 62.1 allocs/op,  13.3 tasks/op,   42063 insns/op,        0 errors)
median 74692.27 tps ( 62.1 allocs/op,  13.3 tasks/op,   41994 insns/op,        0 errors)
median 75402.64 tps ( 62.1 allocs/op,  13.3 tasks/op,   42015 insns/op,        0 errors)

-- writes

scylla_with_chunked_vector-write-no-tablets.txt
median 68635.17 tps ( 58.4 allocs/op,  13.3 tasks/op,   52709 insns/op,        0 errors)
median 68716.36 tps ( 58.4 allocs/op,  13.3 tasks/op,   52691 insns/op,        0 errors)
median 68512.76 tps ( 58.4 allocs/op,  13.3 tasks/op,   52721 insns/op,        0 errors)
median 68606.14 tps ( 58.4 allocs/op,  13.3 tasks/op,   52696 insns/op,        0 errors)
median 68619.25 tps ( 58.4 allocs/op,  13.3 tasks/op,   52697 insns/op,        0 errors)

scylla_with_hash_map-write-no-tablets.txt
median 67678.10 tps ( 58.4 allocs/op,  13.3 tasks/op,   52723 insns/op,        0 errors)
median 67966.06 tps ( 58.4 allocs/op,  13.3 tasks/op,   52736 insns/op,        0 errors)
median 67881.47 tps ( 58.4 allocs/op,  13.3 tasks/op,   52743 insns/op,        0 errors)
median 67856.81 tps ( 58.4 allocs/op,  13.3 tasks/op,   52730 insns/op,        0 errors)
median 67812.58 tps ( 58.4 allocs/op,  13.3 tasks/op,   52740 insns/op,        0 errors)

scylla_with_chunked_vector-write-tablets-1.txt
median 67741.83 tps ( 58.4 allocs/op,  13.3 tasks/op,   53425 insns/op,        0 errors)
median 68014.20 tps ( 58.4 allocs/op,  13.3 tasks/op,   53455 insns/op,        0 errors)
median 68228.48 tps ( 58.4 allocs/op,  13.3 tasks/op,   53447 insns/op,        0 errors)
median 67950.96 tps ( 58.4 allocs/op,  13.3 tasks/op,   53443 insns/op,        0 errors)
median 67832.69 tps ( 58.4 allocs/op,  13.3 tasks/op,   53462 insns/op,        0 errors)

scylla_with_hash_map-write-tablets-1.txt
median 66873.70 tps ( 58.4 allocs/op,  13.3 tasks/op,   53548 insns/op,        0 errors)
median 67568.23 tps ( 58.4 allocs/op,  13.3 tasks/op,   53547 insns/op,        0 errors)
median 67653.70 tps ( 58.4 allocs/op,  13.3 tasks/op,   53525 insns/op,        0 errors)
median 67389.21 tps ( 58.4 allocs/op,  13.3 tasks/op,   53536 insns/op,        0 errors)
median 67437.91 tps ( 58.4 allocs/op,  13.3 tasks/op,   53537 insns/op,        0 errors)

scylla_with_chunked_vector-write-tablets-128.txt
median 67115.41 tps ( 58.3 allocs/op,  13.3 tasks/op,   53341 insns/op,        0 errors)
median 66836.07 tps ( 58.3 allocs/op,  13.3 tasks/op,   53342 insns/op,        0 errors)
median 67214.07 tps ( 58.3 allocs/op,  13.3 tasks/op,   53303 insns/op,        0 errors)
median 67198.25 tps ( 58.3 allocs/op,  13.3 tasks/op,   53347 insns/op,        0 errors)
median 67368.78 tps ( 58.3 allocs/op,  13.3 tasks/op,   53374 insns/op,        0 errors)

scylla_with_hash_map-write-tablets-128.txt
median 66273.50 tps ( 58.3 allocs/op,  13.3 tasks/op,   53400 insns/op,        0 errors)
median 66564.89 tps ( 58.3 allocs/op,  13.3 tasks/op,   53432 insns/op,        0 errors)
median 66568.52 tps ( 58.3 allocs/op,  13.3 tasks/op,   53408 insns/op,        0 errors)
median 66368.00 tps ( 58.3 allocs/op,  13.3 tasks/op,   53441 insns/op,        0 errors)
median 66293.55 tps ( 58.3 allocs/op,  13.3 tasks/op,   53408 insns/op,        0 errors)

Fixes #18010.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18093
2024-04-04 16:25:48 +03:00
Raphael S. Carvalho
29f9f7594f replica: Kill table::storage_group_id_for_token()
storage_group_id_for_token() was only needed from within
tablet_storage_group_manager, so we can kill
table::storage_group_id_for_token().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18134
2024-04-02 09:23:23 +03:00
Raphael S. Carvalho
2c9b13d2d1 compaction: Check for key presence in memtable when calculating max purgeable timestamp
It was observed that some use cases might append old data constantly to
memtable, blocking GC of expired tombstones.

That's because timestamp of memtable is unconditionally used for
calculating max purgeable, even when the memtable doesn't contain the
key of the tombstone we're trying to GC.

The idea is to treat memtable as we treat L0 sstables, i.e. it will
only prevent GC if it contains data that is possibly shadowed by the
expired tombstone (after checking for key presence and timestamp).

Memtable will usually have a small subset of keys in largest tier,
so after this change, a large fraction of keys containing expired
tombstones can be GCed when memtable contains old data.

Fixes #17599.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#17835
2024-03-18 13:37:44 +02:00
Benny Halevy
0745865914 storage_group_manager: add make_sstable_set
Move the responsibility for preparing the table_set
covering all sstables in the table to the storage_group_manager
so it can specialize the sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-06 10:35:36 +02:00
Benny Halevy
7f203f0551 table: move compaction_group_list and storage_group_vector to storage_group_manager
So the storage_group_manager can be used later by table_sstable_set.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-06 10:35:33 +02:00
Benny Halevy
bfe13daed4 compaction_group, table: make_compound_sstable_set: declare as const
It does not modify the compaction_group/table respectively.
This is required by the next patch.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-06 10:15:34 +02:00
Benny Halevy
d7b1851449 tablet_storage_group_manager: precalculate my_host_id and _tablet_map
The node host_id never changes, so get it once,
when the object is constructed.

A pointer to the tablet_map is taken when constructed
using the effective_replication_map and it is
updated whenever the e_r_m changes, using a newly added
`update_effective_replication_map` method.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2024-03-06 10:15:34 +02:00