Commit Graph

477 Commits

Author SHA1 Message Date
Raphael S. Carvalho
721f0ff46d replica: Fix race between split compaction and migration
After removal of rwlock (53a6ec05ed), the race was introduced because the order that
compaction groups of a tablet are closed, is no longer deterministic.

Some background first:
Split compaction runs in main (unsplit) group, and adds sstable to left and right groups
on completion.

The race works as follow:
1) split compaction starts on main group of tablet X
2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel
3) left or right group are closed before main (more likely when only main has flush work to do)
4) split compaction completes, and adds sstable to left and right
5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that
happens in row cache update's execute(), node crashes.

The problem manifested as follow:
[shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0
...
[shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...]
...
[shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found),
at: ...
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(...
   --------
   seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> >
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::internal::coroutine_traits_base<void>::promise_type
   --------
   seastar::(anonymous namespace)::thread_wake_task
   --------
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(...
   seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu...

From the log above, it can be seen cache update failure happens under streaming sched group and
during compaction completion, which was good evidence to the cause.
Problem was reproduced locally with the help of tablet shuffling.

Fixes: #19873.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 5af1f41ecd)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
31451ec2a0 replica: remove rwlock for protecting iteration over storage group map
rwlock was added to protect iterations against concurrent updates to the map.

the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup).

the rwlock is very problematic because it can result in topology changes blocked, as updating
token metadata takes the exclusive lock, which is serialized with table wide ops like
split / major / explicit flush (and those can take a long time).

to get rid of the lock, we can copy the storage group map and guard individual groups with a gate
(not a problem since map is expected to have a maximum of ~100 elements).
so cleanup can close that gate (carefully closed after stopping individual groups such that
migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered
by nodetool flush) can skip a group that was closed, as such a group is being migrated out.

Check documentation added to compaction_group.hh to understand how
concurrent iterations and updates to the map work without the rwlock.

Yielding variants that iterate over groups are no longer returning group
id since id stability can no longer be guaranteed without serializing split
finalization and iteration.

Fixes #18821.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit c539b7c861)
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
f6da439ac6 replica: remove linear search when picking memtable_list for range scan with tablets
with tablets, we're expected to have a worst of ~100 tablets in a given
table and shard, so let's avoid linear search when looking for the
memtable_list in a range scan. we're bounded by ~100 elements, so
shouldn't be a big problem, but it's an inefficiency we can easily
get rid of.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#19286

(cherry picked from commit f143f5b90d)
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-08-13 12:26:13 -03:00
Raphael S. Carvalho
39eb44dfa0 replica: get rid of fragile compaction group intrusive list
It was added to make integration of storage groups easier, but it's
complicated since it's another source of truth and we could have
problems if it becomes inconsistent with the group map.

Fixes #18506.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit ad5c5bca5f)
2024-08-13 12:26:11 -03:00
Botond Dénes
b18d9e5d0d Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB
This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken.
This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series.

Fixes: https://github.com/scylladb/scylladb/issues/18674

- [x] This patch has to be backported because it fixes broken functionality

(cherry picked from commit dbccb61636)

(cherry picked from commit 4590026b38)

(cherry picked from commit feea609e37)

(cherry picked from commit 0c61b1822c)

(cherry picked from commit 8ef4fbdb87)

 Refs #18705

Closes scylladb/scylladb#19240

* github.com:scylladb/scylladb:
  test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
  test/pylib: rest_client: add get_injection()
  api/error_injection: add getter for error_injection
  utils/error_injection: add set_parameter()
  replica/database: fix live-update enable_compacting_data_for_streaming_and_repair
2024-06-13 12:45:23 +03:00
Botond Dénes
0d13c51dd4 test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update
Avoid this the live-update feature of this config item breaking
silently.

(cherry picked from commit 8ef4fbdb87)
2024-06-11 17:32:37 +00:00
Raphael S. Carvalho
d4c3a43b34 replica: Refresh mutation source when allocating tablet replicas
Consider the following:

1) table A has N tablets and views
2) migration starts for a tablet of A from node 1 to 2.
3) migration is at write_both_read_old stage
4) coordinator will push writes to both nodes (pending and leaving)
5) A has view, so writes to it will also result in reads (table::push_view_replica_updates())
6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in)
7) so read on step 5 is not being able to find sstable set for tablet migrating in

Causes the following error:
"tablets - SSTable set wasn't found for tablet 21 of table mview.users"

which means loss of write on pending replica.

The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot.
It's not a problem to refresh the cache snapshot as long as the logical
state of the data hasn't changed, which is true when allocating new
tablet replicas. That's also done in the context of compactions for example.

Fixes #19052.
Fixes #19033.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 7b41630299)

Closes scylladb/scylladb#19229
2024-06-11 18:12:43 +03:00
Botond Dénes
341c29bd74 Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho
Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability.
If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count.

Fixes https://github.com/scylladb/scylladb/issues/18085.

(cherry picked from commit abcc68dbe7)

(cherry picked from commit 551bf9dd58)

(cherry picked from commit e7246751b6)

Refs https://github.com/scylladb/scylladb/pull/18287

Closes scylladb/scylladb#19095

* github.com:scylladb/scylladb:
  topology_experimental_raft/test_tablets: restore usage of check_with_down
  test: Fix flakiness in topology_experimental_raft/test_tablets
  service: Use tablet read selector to determine which replica to account table stats
  storage_service: Fix race between tablet split and stats retrieval
2024-06-05 13:06:32 +03:00
Raphael S. Carvalho
9a341a65af replica: Only consume memtable of the tablet intersecting with range read
storage_proxy is responsible for intersecting the range of the read
with tablets, and calling replica with a single tablet range, therefore
it makes sense to avoid touching memtables of tablets that don't
intersect with a particular range.

Note this is a performance issue, not correctness one, as memtable
readers that don't intersect with current range won't produce any
data, but cpu is wasted until that's realized (they're added to list
of readers in mutation_reader_merger, more allocations, more data
sources to peek into, etc).

That's also important for streaming e.g. after decommission, that
will consume one tablet at a time through a reader, so we don't want
memtables of streamed tablets (that weren't cleaned up yet) to
be consumed.

Refs #18904.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 832fb43fb4)

Closes scylladb/scylladb#18983
2024-06-04 16:17:47 +03:00
Raphael S. Carvalho
3cb71c5b88 replica: Fix race of tablet snapshot with compaction
tablet snapshot, used by migration, can race with compaction and
can find files deleted. That won't cause data loss because the
error is propagated back into the coordinator that decides to
retry streaming stage. So the consequence is delayed migration,
which might in turn reduce node operation throughput (e.g.
when decommissioning a node). It should be rare though, so
shouldn't have drastic consequences.

Fixes #18977.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit b396b05e20)

Closes scylladb/scylladb#19008
2024-06-03 12:21:52 +03:00
Raphael S. Carvalho
55a45e3486 storage_service: Fix race between tablet split and stats retrieval
If tablet split is finalized while retrieving stats, the saved erm, used by all
shards, will be invalidated. It can either cause incorrect behavior or
crash if id is not available.

It's worked by feeding local tablet map into the "coordinator"
collecting stats from all shards. We will also no longer have a snapshot
of erm shared between shards to help intra-node migration. This is
simplified by serializing token metadata changes and the retrieval of
the stats (latter should complete pretty fast, so it shouldn't block
the former for any significant time).

Fixes #18085.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit abcc68dbe7)
2024-05-27 18:21:21 +00:00
Raphael S. Carvalho
13f8486cd7 replica: Fix tablet's compaction_groups_for_token_range() with unowned range
File-based tablet streaming calls every shard to return data of every
group that intersects with a given range.
After dynamic group allocation, that breaks as the tablet range will
only be present in a single shard, so an exception is thrown causing
migration to halt during streaming phase.
Ideally, only one shard is invoked, but that's out of the scope of this
fix and compaction_groups_for_token_range() should return empty result
if none of the local groups intersect with the range.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit eb8ef38543)

Closes scylladb/scylladb#18859
2024-05-27 15:20:04 +03:00
Avi Kivity
52fe351c31 Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec
This is needed to avoid severe imbalance between shards which can
happen when some table grows and is split. The inter-node balance can
be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N
then there is not even a possibility of moving tablets around to fix the imbalance.
The only way to bring the system to balance is to move tablets within the nodes.

The system is not prepared for intra-node migration currently. Request coordination
is host-based, while for intra-node migration it should be (also) shard-based.
The solution employed here is to keep the coordination between nodes as-is,
and for intra-node migration storage_proxy-level coordinator is not aware of
the migration (no pending host). The replica-side request handler will be a
second-level coordinator which routes requests to shards, similar to how
the first-level coordinator routes them to hosts.

Tablet sharder is adjusted to handle intra-migration where a tablet
can have two replicas on the same host. For reads, sharder uses the
read selector to resolve the conflict. For writes, the write selector
is used.

The old shard_of() API is kept to represent shard for reads, and new
method is introduced to query the shards for writing:
shard_for_writes(). All writers should be switched to that API, which
is not done in this patch yet.

The request handler on replica side acts as a second-level
coordinator, using sharder to determine routing to shards. A given
sharder has a scope of a single topology version, a single
effective_replication_map_ptr, which should be kept alive during
writes.

perf-simple-query test results show no signs of regression:

Command: perf-simple-query -c1 -m1G --write --tablets --duration=10

Before:

> 83294.81 tps ( 59.5 allocs/op,  14.3 tasks/op,   53725 insns/op,        0 errors)
> 87756.72 tps ( 59.5 allocs/op,  14.3 tasks/op,   54049 insns/op,        0 errors)
> 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> 86211.38 tps ( 59.7 allocs/op,  14.3 tasks/op,   54219 insns/op,        0 errors)
> 86559.89 tps ( 59.6 allocs/op,  14.3 tasks/op,   54188 insns/op,        0 errors)
> 86609.39 tps ( 59.6 allocs/op,  14.3 tasks/op,   54117 insns/op,        0 errors)
> 87464.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   54039 insns/op,        0 errors)
> 86185.43 tps ( 59.6 allocs/op,  14.3 tasks/op,   54169 insns/op,        0 errors)
> 86254.71 tps ( 59.6 allocs/op,  14.3 tasks/op,   54139 insns/op,        0 errors)
> 83395.35 tps ( 60.2 allocs/op,  14.4 tasks/op,   54693 insns/op,        0 errors)
>
> median 86428.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54208 insns/op,        0 errors)
> median absolute deviation: 243.04
> maximum: 87756.72
> minimum: 83294.81
>

After:

> 85523.06 tps ( 59.5 allocs/op,  14.3 tasks/op,   53872 insns/op,        0 errors)
> 89362.47 tps ( 59.6 allocs/op,  14.3 tasks/op,   54226 insns/op,        0 errors)
> 88167.55 tps ( 59.7 allocs/op,  14.3 tasks/op,   54400 insns/op,        0 errors)
> 87044.40 tps ( 59.7 allocs/op,  14.3 tasks/op,   54310 insns/op,        0 errors)
> 88344.50 tps ( 59.6 allocs/op,  14.3 tasks/op,   54289 insns/op,        0 errors)
> 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> 88725.46 tps ( 59.6 allocs/op,  14.3 tasks/op,   54230 insns/op,        0 errors)
> 88640.08 tps ( 59.6 allocs/op,  14.3 tasks/op,   54210 insns/op,        0 errors)
> 90306.31 tps ( 59.4 allocs/op,  14.3 tasks/op,   54043 insns/op,        0 errors)
> 87343.62 tps ( 59.8 allocs/op,  14.3 tasks/op,   54496 insns/op,        0 errors)
>
> median 88355.06 tps ( 59.6 allocs/op,  14.3 tasks/op,   54242 insns/op,        0 errors)
> median absolute deviation: 1007.41
> maximum: 90306.31
> minimum: 85523.06

Command (reads): perf-simple-query -c1 -m1G  --tablets --duration=10

Before:

> 95860.18 tps ( 63.1 allocs/op,  14.1 tasks/op,   42476 insns/op,        0 errors)
> 97537.69 tps ( 63.1 allocs/op,  14.1 tasks/op,   42454 insns/op,        0 errors)
> 97549.23 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97511.29 tps ( 63.1 allocs/op,  14.1 tasks/op,   42470 insns/op,        0 errors)
> 97227.32 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 94031.94 tps ( 63.1 allocs/op,  14.1 tasks/op,   42441 insns/op,        0 errors)
> 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> 96401.70 tps ( 63.1 allocs/op,  14.1 tasks/op,   42473 insns/op,        0 errors)
> 96573.77 tps ( 63.1 allocs/op,  14.1 tasks/op,   42440 insns/op,        0 errors)
> 96340.54 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
>
> median 96978.04 tps ( 63.1 allocs/op,  14.1 tasks/op,   42462 insns/op,        0 errors)
> median absolute deviation: 571.20
> maximum: 97549.23
> minimum: 94031.94
>

After:

> 99794.67 tps ( 63.1 allocs/op,  14.1 tasks/op,   42471 insns/op,        0 errors)
> 101244.99 tps ( 63.1 allocs/op,  14.1 tasks/op,   42472 insns/op,        0 errors)
> 101128.37 tps ( 63.1 allocs/op,  14.1 tasks/op,   42485 insns/op,        0 errors)
> 101065.27 tps ( 63.1 allocs/op,  14.1 tasks/op,   42465 insns/op,        0 errors)
> 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> 101413.31 tps ( 63.1 allocs/op,  14.1 tasks/op,   42463 insns/op,        0 errors)
> 101464.92 tps ( 63.1 allocs/op,  14.1 tasks/op,   42466 insns/op,        0 errors)
> 101086.74 tps ( 63.1 allocs/op,  14.1 tasks/op,   42488 insns/op,        0 errors)
> 101559.09 tps ( 63.1 allocs/op,  14.1 tasks/op,   42468 insns/op,        0 errors)
> 100742.58 tps ( 63.1 allocs/op,  14.1 tasks/op,   42491 insns/op,        0 errors)
>
> median 101212.98 tps ( 63.1 allocs/op,  14.1 tasks/op,   42456 insns/op,        0 errors)
> median absolute deviation: 200.33
> maximum: 101559.09
> minimum: 99794.67
>

Fixes #16594

Closes scylladb/scylladb#18026

* github.com:scylladb/scylladb:
  Implement fast streaming for intra-node migration
  test: tablets_test: Test sharding during intra-node migration
  test: tablets_test: Check sharding also on the pending host
  test: py: tablets: Test writes concurrent with migration
  test: py: tablets: Test crash during intra-node migration
  api, storage_service: Introduce API to wait for topology to quiesce
  dht, replica: Remove deprecated sharder APIs
  test: Avoid using deprecated sharded API
  db: do_apply_many() avoid deprecated sharded API
  replica: mutation_dump: Avoid deprecated sharder API
  repair: Avoid deprecated sharder API
  table: Remove optimization which returns empty reader when key is not owned by the shard
  dht: is_single_shard: Avoid deprecated sharder API
  dht: split_range_to_single_shard: Work with static_sharder only
  dht: ring_position_range_sharder: Avoid deprecated sharder APIs
  dht: token: Avoid use of deprecated sharder API by switching to static_sharder
  selective_token_sharder: Avoid use of deprecated sharder API
  docs: Document tablet sharding vs tablet replica placement
  readers/multishard.cc: use shard_for_reads() instead of shard_of()
  multishard_mutation_query.cc: use shard_for_reads() instead of shard_of()
  storage_proxy: Extract common code to apply mutations on many shards according to sharder
  storage_proxy: Prepare per-partition rate-limiting for intra-node migration
  storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate()
  storage_proxy: Prepare mutate_hint() for intra-node tablet migration
  commitlog_replayer: Avoid deprecated sharder::shard_of()
  lwt: Avoid deprecated sharder::shard_of()
  compaction: Avoid deprecated sharder::shard_of()
  dht: Extract dht::static_sharder
  replica: Deprecate table::shard_of()
  locator: Deprecate effective_replication_map::shard_of()
  dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard
  tests: tablets: py: Add intra-node migration test
  tests: tablets: Test that drained nodes are not balanced internally
  tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load
  tests: tablets: Verify that disabling balancing results in no intra-node migrations
  tests: tablets: Check that nodes are internally balanced
  tests: tablets: Improve debuggability by showing which rows are missing
  tablets, storage_service: Support intra-node migration in move_tablet() API
  tablet_allocator: Generate intra-node migration plan
  tablet_allocator: Extract make_internode_plan()
  tablet_allocator: Maintain candidate list and shard tablet count for target nodes
  tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions
  tablets, streaming: Implement tablet streaming for intra-node migration
  dht, auto_refreshing_sharder: Allow overriding write selector
  multishard_writer: Handle intra-node migration
  storage_proxy: Handle intra-node tablet migration for writes
  tablets: Get rid of tablet_map::get_shard()
  tablets: Avoid tablet_map::get_shard in cleanup
  tablets: test: Use sharder instead of tablet_map::get_shard()
  tablets: tablet_sharder: Allow working with non-local host
  sharding: Prepare for intra-node-migration
  docs: Document sharder use for tablets
  tablets: Introduce tablet transition kind for intra-node migration
  tests: tablets: Fix use-after-move of skiplist in rebalance_tablets()
  sstables, gdb: Track readers in a linked list
  raft topology: Fix global token metadata barrier to not fence ahead of what is drained
2024-05-20 16:13:01 +03:00
Piotr Dulikowski
68eca3778c Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros
This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed.

See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions.

This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh.

The existing mechanism works in the following way:

* Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes
* Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking
* We keep track of the percent of consumed units on each node, this is called `view update backlog`.
* Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level.

This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates.

To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`.

The new algorithm of view update generation looks something like this:
```c++
for(;;) {
    auto updates = generate_updates_batch_with_max_100_rows();
    co_await seastar::sleep(calculate_sleep_time_from_backlogs());
    spawn_background_tasks_for_updates(updates);
}
```
Fixes: https://github.com/scylladb/scylladb/issues/12379

Closes scylladb/scylladb#16819

* github.com:scylladb/scylladb:
  test: add test for bad_allocs during large mv queries
  mv: throttle view update generation for large queries
  exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception
  db/view: extract view throttling delay calculation to a global function
  view_update_generator: add get_storage_proxy()
  storage_proxy: make view backlog getters public
2024-05-16 08:22:54 +02:00
Raphael S. Carvalho
715ae689c0 Implement fast streaming for intra-node migration
With intra-node migration, all the movement is local, so we can make
streaming faster by just cloning the sstable set of leaving replica
and loading it into the pending one.

This cloning is underlying storage specific, but s3 doesn't support
snapshot() yet (th sstables::storage procedure which clone is built
upon). It's only supported by file system, with help of hard links.
A new generation is picked for new cloned sstable, and it will
live in the same directory as the original.

A challenge I bumped into was to understand why table refused to
load the sstable at pending replica, as it considered them foreign.
Later I realized that sharder (for reads) at this stage of migration
will point only to leaving replica. It didn't fail with mutation
based streaming, because the sstable writer considers the shard --
that the sstable was written into -- as its owner, regardless of what
sharder says. That was fixed by mimicking this behavior during
loading at pending.

test:
./test.py --mode=dev intranode --repeat=100 passes.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
679baff25a dht, replica: Remove deprecated sharder APIs 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
7c03646f99 table: Remove optimization which returns empty reader when key is not owned by the shard
This check would lead to correctness issues with intra-node migration
because the shard may switch during read, from "read old" to "read
new". If the coordinator used "read old" for shard routing, but table
on the old shard is already using "read new" erm, such a read would
observe empty result, which is wrong.

Drop the optimization. In the scenario above, read will observe all
past writes because:

  1) writes are still using "write both"

  2) writes are switched to "write new" only after all requests which
  might be using "read old" are done

Replica-side coordinators should already route single-key requests to
the correct shard, so it's not important as an optimization.

This issue shows how assumptions about static sharding are embedded in
the current code base and how intra-node migration, by violating those
assumptions, can lead to correctness issues.
2024-05-16 00:28:47 +02:00
Tomasz Grabiec
dbca598e99 replica: Deprecate table::shard_of() 2024-05-16 00:28:47 +02:00
Tomasz Grabiec
6c6ce2d928 tablets: Get rid of tablet_map::get_shard()
Its semantics do not fit well with intra-node migration which allow
two owning shards. Replace uses with the new has_replica() API.
2024-05-16 00:28:46 +02:00
Amnon Heiman
0c84692c97 replica/table.cc: Add metrics per-table-per-node
This patch adds metrics that will be reported per-table per-node.
The added metrics (that are part of the per-table per-shard metrics)
are:
scylla_column_family_cache_hit_rate
scylla_column_family_read_latency
scylla_column_family_write_latency
scylla_column_family_live_disk_space

Fixes #18642

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#18645
2024-05-14 07:54:34 +03:00
Jan Ciolek
e0442d7bfa mv: throttle view update generation for large queries
For every mutation applied to the base table we have to
generate the corresponding materialized view table updates.

In case of simple requests, like INSERT or UPDATE, the number
of view updates generated per base table mutation is limited
to at most a few view table updates per base table update.

The situation is different for DELETE queries, which can delete
the whole partitions or clustering ranges. Range deletions are
fast on the base table, but for the view table the situation
is different. Deleting a single partition in the base table
will generate as many singular view updates as there are rows
in the deleted partition, which could potentially be in the millions.

To prevent OOM view updates are generated in batches of at most 100 rows.
There is a loop which generates the next batch of updates, spawns tasks
to send them to remote nodes, generates another batch and so on.

The problem is that there is no concurrency control - each batch is scheduled
to be sent in the background, but the following batch is generated without
waiting for the previously generated updates to be sent. This can lead to
unbounded concurrency and OOM.

To protect against this view update generation should be limited somehow.

There is an existing mechanism for limiting view updates - throttling.
We keep track of how many pending view updates there are, in the view backlog,
and delay responses to the client based on this backlog's fullness.
For a well behaved client with limited concurrency this will slow down
the amount of incoming requests until it reaches an optimal point.

This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything
for range DELETEs. A DELETE is a single request that generates millions of view
updates, delaying client response doesn't help.

The throttling mechanism could be extend to cover this case - we could treat the
DELETE request like any other client and force it to wait before sending more updates.

This commit implements this approach - before sending the next batch of updates
the generator is forced to sleep for a bit of time, calculated using the exisiting
throttling equation.
The more full the backlog gets the more the generator will have to sleep for,
and hopefully this will prevent overloading the system with view updates.

Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>
2024-05-13 18:16:23 +02:00
Botond Dénes
afa870a387 Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho
Closes scylladb/scylladb#18616

* github.com:scylladb/scylladb:
  replica: Make it explicit table's sstable set is immutable
  replica: avoid reallocations in tablet_sstable_set
  replica: Avoid compound set if only one sstable set is filled
2024-05-13 14:17:24 +03:00
Pavel Emelyanov
2ce643d06b table: Directly compare std::optional<shard_id> with shard_id
There's a loop that calculates the number of shard matches over a tablet
map. The check of the given shard against optional<shard> can be made
shorter.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18592
2024-05-13 13:25:05 +03:00
Raphael S. Carvalho
7faba69f28 replica: Make it explicit table's sstable set is immutable
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 11:58:08 -03:00
Raphael S. Carvalho
35a0d47408 replica: Avoid compound set if only one sstable set is filled
Most of the time only main set is filled, so we can avoid one layer
of indirection (= compound set) when maintenance set is empty.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk
b4371a0ea0 replica: allocate storage groups dynamically
Currently empty storage_groups are allocated for tablets that are
not on this shard.

Allocate storage groups dynamically, i.e.:
- on table creation allocate only storage groups that are on this
  shard;
- allocate a storage group for tablet that is moved to this shard;
- deallocate storage group for tablet that is cleaned up.

Stop compaction group before it's deallocated.

Add a flag to table::cleanup_tablet deciding whether to deallocate
sgs and use it in commitlog tests.
2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk
6e1e082e8c replica: refresh snapshot in compaction_group::cleanup
During compaction_group::cleanup sstables set is updated, but
row_cache::_underlaying still keeps a shared ptr to the old set.
Due to that descriptors to deleted sstables aren't closed.

Refresh snapshot in order to store new sstables set in _underlying
mutation source.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
c283746b32 replica: add rwlock to storage_group_manager
Add rwlock which prevents storage groups from being added/deleted
while some other layers itereates over them (or their compaction
groups).

Add methods to iterate over storage groups with the lock held.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
532653f118 replica: replace table::as_table_state
Replace table::as_table_state with table::try_get_table_state_with_static_sharding
which throws if a table does not use static sharding.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
cf9913b0b7 compaction: pass compaction group id to reshape_compaction_group
Pass compaction group id to
shard_reshaping_compaction_task_impl::reshape_compaction_group.
Modify table::as_table_state to return table_state of the given
compaction group.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
90d618d8c9 replica: open code get_compaction_group in perform_cleanup_compaction
Open code get_compaction_group in table::perform_cleanup_compaction
as its definition won't be relevant once storage groups are allocated
dynamically.
2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk
8505389963 replica: drop single_compaction_group_if_available
Drop single_compaction_group_if_available as it's unused.
2024-05-10 14:56:38 +02:00
Avi Kivity
37d32a5f8b Merge 'Cleanup inactive reads on tablet migration' from Botond Dénes
When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data.
This series introduces the fix as well as a reproducer test.

Fixes: https://github.com/scylladb/scylladb/issues/18110

Closes scylladb/scylladb#18179

* github.com:scylladb/scylladb:
  test: add test for cleaning up cached querier on tablet migration
  querier: allow injecting cache entry ttl by error injector
  replica/table: cleanup_tablet(): clear inactive reads for the tablet
  replica/database: introduce clear_inactive_reads_for_tablet()
  replica/database: introduce foreach_reader_concurrency_semaphore
  reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table()
  reader_concurrency_semaphore: allow storing a range with the inactive reader
  reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()
2024-05-09 17:34:49 +03:00
Pavel Emelyanov
677e80a4d5 table: Coroutinize table::delete_sstables_atomically()
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18499
2024-05-07 17:10:28 +02:00
Raphael S. Carvalho
b980634ff2 test: Verify tablet cleanup is properly retried on failure
Doesn't test only coordinator ability to retry on failure, but also
that replica will be able to properly continue cleanup of a storage
group from where it left off (when failure happened), not leave any
sstables behind.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18426
2024-04-30 19:27:17 +02:00
Botond Dénes
03995d9397 replica/table: cleanup_tablet(): clear inactive reads for the tablet
To avoid any resource surviving the cleanup, via some inactive read
pinning it. This can cause data resurrection if the tablet is later
migrated back and the pinned data source is added back to the tablet.
2024-04-30 01:47:16 -04:00
Botond Dénes
044fd7a3ec Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov
The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself.

Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies.

Closes scylladb/scylladb#18421

* github.com:scylladb/scylladb:
  replica: Do not carry view concurrency semaphore pointer around
  view: Get concurrency semaphore via database, not table
  view_update_generator: Mark mutate_MV() private
  view: Move view_update_generator methods' code
  view: Move table::generate_and_propagate_view_updates into view code
  view: Move table::populate_views() into view_update_generator class
2024-04-26 10:55:38 +03:00
Raphael S. Carvalho
4a5fdc5814 table: Remove outdated FIXME about sstable spanning multiple tablets
The FIXME was added back then because we thought the interface of
compaction_group_for_sstable might have to be adjusted if a sstable
were allowed to temporarily span multiple tablets until it's split,
but we have gone a different path.
If a sstable's key range incorrectly spans more than one tablet,
that will be considered a bug and an exception is thrown.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18410
2024-04-25 17:21:11 +03:00
Pavel Emelyanov
18cc2cfa31 replica: Generalize snapshot details for single table/snapshot dir
There are two places that get total:live stats for a table snapshot --
database::get_snapshot_details() and table::get_snapshot_details(). Both
do pretty similar thing -- walk the table/snapshots/ directory, then
each of the found sub-directory and accumulate the found files' sizes
into snapshot details structure.

Both try to tell total from live sizes by checking whether an sstable
component found in snapshots is present in the table datadir. The
database code does it in a more correct way -- not just checks the file
presense, but also compares if it's a hardlink on the snapshot file,
while the table code just checks if the file of the same name exists.

This patch does both -- makes both database and table call the same
helper method for a single snapshot details, and makes the generalized
version use more elaborated collision check, thus fixing the per-table
details getting behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18347
2024-04-25 17:12:42 +03:00
Pavel Emelyanov
bc4552740f view: Move view_update_generator methods' code
Now when the two methods belong to another class, move the code itself
to db/view , where the class itself resides.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:24:20 +03:00
Pavel Emelyanov
c2bf6b43b2 view: Move table::generate_and_propagate_view_updates into view code
Similarly to populate_views() method, this one also naturally belongs to
view_update_generator class.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:20:06 +03:00
Pavel Emelyanov
670c7c925c view: Move table::populate_views() into view_update_generator class
The method in question has little to do with table, effectively it only
needs stats and consurrency semaphore. And the semaphore in question is
obtained from table indirectly, it really resides on database. On the
other hand, the method carries lots of bits from db::view, e.g. the
view_update_builder class, memory_usage_of() helper and a bit more.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-25 14:17:20 +03:00
Pavel Emelyanov
f5f57dc817 table: No need to open directory in snapshot_exists()
In order to check if a snapshot of a certain name exists the checking
method opens directory. It can be made with more lightweight call.

Also, though not critical, is that it fogets to close it.

Coroutinuze the method while at it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18365
2024-04-23 17:19:24 +03:00
Pavel Emelyanov
be5bc38cde table: Use directory semaphore from sstables manager
It's natural for a table to itarate over its sstables, get the semaphore
from the manager of sstables, not from database.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
7e7dd2649b table: Indentation fix after previous patch
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
2fced3c557 table: Use directory_semaphore for rate-limited snapshot taking
The table::take_snapshot() limits its parallelizm with the help of
direcoty semaphore already, but implements it "by hand". There's already
parallel_for_each() method on the dir.sem. class that does exactly that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
0d2178202d table: Use smp::all_cpus() to iterate over all CPUs locally
Currently it uses irange(0, smp::count0), but seastar provides
convenient helper call for the very same range object.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2024-04-19 13:53:57 +03:00
Pavel Emelyanov
5cf53e670d replica: Remove unused ex variable from table::take_snapshot
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#18215
2024-04-16 07:16:38 +03:00
Raphael S. Carvalho
9f93dd9fa3 replica: Use flat_hash_map for tablet storage
The reason that we want to switch to flat_hash_map is that only a small
subset of tablets will be allocated on any given shard, therefore it's
wasteful to use a sparse array, and iterations are slow.
Also, the map gives greater development flexibility as one doesn't have
to worry about empty entries.

perf result:

-- reads

scylla_with_chunked_vector-read-no-tablets.txt
median 73223.28 tps ( 62.3 allocs/op,  13.3 tasks/op,   41932 insns/op,        0 errors)
median 74952.87 tps ( 62.3 allocs/op,  13.3 tasks/op,   41969 insns/op,        0 errors)
median 73016.37 tps ( 62.3 allocs/op,  13.3 tasks/op,   41934 insns/op,        0 errors)
median 74078.14 tps ( 62.3 allocs/op,  13.3 tasks/op,   41938 insns/op,        0 errors)
median 75323.07 tps ( 62.3 allocs/op,  13.3 tasks/op,   41944 insns/op,        0 errors)

scylla_with_hash_map-read-no-tablets.txt
median 74963.30 tps ( 62.3 allocs/op,  13.3 tasks/op,   41926 insns/op,        0 errors)
median 74032.09 tps ( 62.3 allocs/op,  13.3 tasks/op,   41918 insns/op,        0 errors)
median 74850.09 tps ( 62.3 allocs/op,  13.3 tasks/op,   41937 insns/op,        0 errors)
median 74239.37 tps ( 62.3 allocs/op,  13.3 tasks/op,   41921 insns/op,        0 errors)
median 74798.14 tps ( 62.3 allocs/op,  13.3 tasks/op,   41925 insns/op,        0 errors)

scylla_with_chunked_vector-read-tablets-1.txt
median 74234.27 tps ( 62.1 allocs/op,  13.3 tasks/op,   41903 insns/op,        0 errors)
median 75775.98 tps ( 62.1 allocs/op,  13.3 tasks/op,   41910 insns/op,        0 errors)
median 76481.56 tps ( 62.1 allocs/op,  13.2 tasks/op,   41874 insns/op,        0 errors)
median 74056.67 tps ( 62.1 allocs/op,  13.3 tasks/op,   41894 insns/op,        0 errors)
median 75287.68 tps ( 62.1 allocs/op,  13.3 tasks/op,   41894 insns/op,        0 errors)

scylla_with_hash_map-read-tablets-1.txt
median 75613.63 tps ( 62.1 allocs/op,  13.2 tasks/op,   41990 insns/op,        0 errors)
median 74819.51 tps ( 62.1 allocs/op,  13.2 tasks/op,   41973 insns/op,        0 errors)
median 75648.41 tps ( 62.1 allocs/op,  13.3 tasks/op,   42025 insns/op,        0 errors)
median 74170.89 tps ( 62.1 allocs/op,  13.2 tasks/op,   42002 insns/op,        0 errors)
median 75447.72 tps ( 62.1 allocs/op,  13.3 tasks/op,   41952 insns/op,        0 errors)

scylla_with_chunked_vector-read-tablets-128.txt
median 73788.57 tps ( 62.1 allocs/op,  13.2 tasks/op,   41956 insns/op,        0 errors)
median 76563.63 tps ( 62.1 allocs/op,  13.3 tasks/op,   42006 insns/op,        0 errors)
median 75536.12 tps ( 62.1 allocs/op,  13.2 tasks/op,   42005 insns/op,        0 errors)
median 74679.17 tps ( 62.1 allocs/op,  13.3 tasks/op,   41958 insns/op,        0 errors)
median 75380.95 tps ( 62.1 allocs/op,  13.2 tasks/op,   41946 insns/op,        0 errors)

scylla_with_hash_map-read-tablets-128.txt
median 75459.99 tps ( 62.1 allocs/op,  13.3 tasks/op,   42055 insns/op,        0 errors)
median 74280.11 tps ( 62.1 allocs/op,  13.3 tasks/op,   42085 insns/op,        0 errors)
median 74502.61 tps ( 62.1 allocs/op,  13.3 tasks/op,   42063 insns/op,        0 errors)
median 74692.27 tps ( 62.1 allocs/op,  13.3 tasks/op,   41994 insns/op,        0 errors)
median 75402.64 tps ( 62.1 allocs/op,  13.3 tasks/op,   42015 insns/op,        0 errors)

-- writes

scylla_with_chunked_vector-write-no-tablets.txt
median 68635.17 tps ( 58.4 allocs/op,  13.3 tasks/op,   52709 insns/op,        0 errors)
median 68716.36 tps ( 58.4 allocs/op,  13.3 tasks/op,   52691 insns/op,        0 errors)
median 68512.76 tps ( 58.4 allocs/op,  13.3 tasks/op,   52721 insns/op,        0 errors)
median 68606.14 tps ( 58.4 allocs/op,  13.3 tasks/op,   52696 insns/op,        0 errors)
median 68619.25 tps ( 58.4 allocs/op,  13.3 tasks/op,   52697 insns/op,        0 errors)

scylla_with_hash_map-write-no-tablets.txt
median 67678.10 tps ( 58.4 allocs/op,  13.3 tasks/op,   52723 insns/op,        0 errors)
median 67966.06 tps ( 58.4 allocs/op,  13.3 tasks/op,   52736 insns/op,        0 errors)
median 67881.47 tps ( 58.4 allocs/op,  13.3 tasks/op,   52743 insns/op,        0 errors)
median 67856.81 tps ( 58.4 allocs/op,  13.3 tasks/op,   52730 insns/op,        0 errors)
median 67812.58 tps ( 58.4 allocs/op,  13.3 tasks/op,   52740 insns/op,        0 errors)

scylla_with_chunked_vector-write-tablets-1.txt
median 67741.83 tps ( 58.4 allocs/op,  13.3 tasks/op,   53425 insns/op,        0 errors)
median 68014.20 tps ( 58.4 allocs/op,  13.3 tasks/op,   53455 insns/op,        0 errors)
median 68228.48 tps ( 58.4 allocs/op,  13.3 tasks/op,   53447 insns/op,        0 errors)
median 67950.96 tps ( 58.4 allocs/op,  13.3 tasks/op,   53443 insns/op,        0 errors)
median 67832.69 tps ( 58.4 allocs/op,  13.3 tasks/op,   53462 insns/op,        0 errors)

scylla_with_hash_map-write-tablets-1.txt
median 66873.70 tps ( 58.4 allocs/op,  13.3 tasks/op,   53548 insns/op,        0 errors)
median 67568.23 tps ( 58.4 allocs/op,  13.3 tasks/op,   53547 insns/op,        0 errors)
median 67653.70 tps ( 58.4 allocs/op,  13.3 tasks/op,   53525 insns/op,        0 errors)
median 67389.21 tps ( 58.4 allocs/op,  13.3 tasks/op,   53536 insns/op,        0 errors)
median 67437.91 tps ( 58.4 allocs/op,  13.3 tasks/op,   53537 insns/op,        0 errors)

scylla_with_chunked_vector-write-tablets-128.txt
median 67115.41 tps ( 58.3 allocs/op,  13.3 tasks/op,   53341 insns/op,        0 errors)
median 66836.07 tps ( 58.3 allocs/op,  13.3 tasks/op,   53342 insns/op,        0 errors)
median 67214.07 tps ( 58.3 allocs/op,  13.3 tasks/op,   53303 insns/op,        0 errors)
median 67198.25 tps ( 58.3 allocs/op,  13.3 tasks/op,   53347 insns/op,        0 errors)
median 67368.78 tps ( 58.3 allocs/op,  13.3 tasks/op,   53374 insns/op,        0 errors)

scylla_with_hash_map-write-tablets-128.txt
median 66273.50 tps ( 58.3 allocs/op,  13.3 tasks/op,   53400 insns/op,        0 errors)
median 66564.89 tps ( 58.3 allocs/op,  13.3 tasks/op,   53432 insns/op,        0 errors)
median 66568.52 tps ( 58.3 allocs/op,  13.3 tasks/op,   53408 insns/op,        0 errors)
median 66368.00 tps ( 58.3 allocs/op,  13.3 tasks/op,   53441 insns/op,        0 errors)
median 66293.55 tps ( 58.3 allocs/op,  13.3 tasks/op,   53408 insns/op,        0 errors)

Fixes #18010.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18093
2024-04-04 16:25:48 +03:00
Raphael S. Carvalho
29f9f7594f replica: Kill table::storage_group_id_for_token()
storage_group_id_for_token() was only needed from within
tablet_storage_group_manager, so we can kill
table::storage_group_id_for_token().

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#18134
2024-04-02 09:23:23 +03:00