scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	721f0ff46d	replica: Fix race between split compaction and migration After removal of rwlock (`53a6ec05ed`), the race was introduced because the order that compaction groups of a tablet are closed, is no longer deterministic. Some background first: Split compaction runs in main (unsplit) group, and adds sstable to left and right groups on completion. The race works as follow: 1) split compaction starts on main group of tablet X 2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel 3) left or right group are closed before main (more likely when only main has flush work to do) 4) split compaction completes, and adds sstable to left and right 5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that happens in row cache update's execute(), node crashes. The problem manifested as follow: [shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0 ... [shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...] ... [shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found), at: ... -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(... -------- seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> > -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::(anonymous namespace)::thread_wake_task -------- seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(... seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu... From the log above, it can be seen cache update failure happens under streaming sched group and during compaction completion, which was good evidence to the cause. Problem was reproduced locally with the help of tablet shuffling. Fixes: #19873. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `5af1f41ecd`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	31451ec2a0	replica: remove rwlock for protecting iteration over storage group map rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Check documentation added to compaction_group.hh to understand how concurrent iterations and updates to the map work without the rwlock. Yielding variants that iterate over groups are no longer returning group id since id stability can no longer be guaranteed without serializing split finalization and iteration. Fixes #18821. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `c539b7c861`)	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	f6da439ac6	replica: remove linear search when picking memtable_list for range scan with tablets with tablets, we're expected to have a worst of ~100 tablets in a given table and shard, so let's avoid linear search when looking for the memtable_list in a range scan. we're bounded by ~100 elements, so shouldn't be a big problem, but it's an inefficiency we can easily get rid of. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#19286 (cherry picked from commit `f143f5b90d`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	39eb44dfa0	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `ad5c5bca5f`)	2024-08-13 12:26:11 -03:00
Botond Dénes	b18d9e5d0d	Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken. This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series. Fixes: https://github.com/scylladb/scylladb/issues/18674 - [x] This patch has to be backported because it fixes broken functionality (cherry picked from commit `dbccb61636`) (cherry picked from commit `4590026b38`) (cherry picked from commit `feea609e37`) (cherry picked from commit `0c61b1822c`) (cherry picked from commit `8ef4fbdb87`) Refs #18705 Closes scylladb/scylladb#19240 * github.com:scylladb/scylladb: test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update test/pylib: rest_client: add get_injection() api/error_injection: add getter for error_injection utils/error_injection: add set_parameter() replica/database: fix live-update enable_compacting_data_for_streaming_and_repair	2024-06-13 12:45:23 +03:00
Botond Dénes	0d13c51dd4	test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update Avoid this the live-update feature of this config item breaking silently. (cherry picked from commit `8ef4fbdb87`)	2024-06-11 17:32:37 +00:00
Raphael S. Carvalho	d4c3a43b34	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `7b41630299`) Closes scylladb/scylladb#19229	2024-06-11 18:12:43 +03:00
Botond Dénes	341c29bd74	Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes https://github.com/scylladb/scylladb/issues/18085. (cherry picked from commit `abcc68dbe7`) (cherry picked from commit `551bf9dd58`) (cherry picked from commit `e7246751b6`) Refs https://github.com/scylladb/scylladb/pull/18287 Closes scylladb/scylladb#19095 * github.com:scylladb/scylladb: topology_experimental_raft/test_tablets: restore usage of check_with_down test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-06-05 13:06:32 +03:00
Raphael S. Carvalho	9a341a65af	replica: Only consume memtable of the tablet intersecting with range read storage_proxy is responsible for intersecting the range of the read with tablets, and calling replica with a single tablet range, therefore it makes sense to avoid touching memtables of tablets that don't intersect with a particular range. Note this is a performance issue, not correctness one, as memtable readers that don't intersect with current range won't produce any data, but cpu is wasted until that's realized (they're added to list of readers in mutation_reader_merger, more allocations, more data sources to peek into, etc). That's also important for streaming e.g. after decommission, that will consume one tablet at a time through a reader, so we don't want memtables of streamed tablets (that weren't cleaned up yet) to be consumed. Refs #18904. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `832fb43fb4`) Closes scylladb/scylladb#18983	2024-06-04 16:17:47 +03:00
Raphael S. Carvalho	3cb71c5b88	replica: Fix race of tablet snapshot with compaction tablet snapshot, used by migration, can race with compaction and can find files deleted. That won't cause data loss because the error is propagated back into the coordinator that decides to retry streaming stage. So the consequence is delayed migration, which might in turn reduce node operation throughput (e.g. when decommissioning a node). It should be rare though, so shouldn't have drastic consequences. Fixes #18977. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `b396b05e20`) Closes scylladb/scylladb#19008	2024-06-03 12:21:52 +03:00
Raphael S. Carvalho	55a45e3486	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `abcc68dbe7`)	2024-05-27 18:21:21 +00:00
Raphael S. Carvalho	13f8486cd7	replica: Fix tablet's compaction_groups_for_token_range() with unowned range File-based tablet streaming calls every shard to return data of every group that intersects with a given range. After dynamic group allocation, that breaks as the tablet range will only be present in a single shard, so an exception is thrown causing migration to halt during streaming phase. Ideally, only one shard is invoked, but that's out of the scope of this fix and compaction_groups_for_token_range() should return empty result if none of the local groups intersect with the range. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `eb8ef38543`) Closes scylladb/scylladb#18859	2024-05-27 15:20:04 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	679baff25a	dht, replica: Remove deprecated sharder APIs	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7c03646f99	table: Remove optimization which returns empty reader when key is not owned by the shard This check would lead to correctness issues with intra-node migration because the shard may switch during read, from "read old" to "read new". If the coordinator used "read old" for shard routing, but table on the old shard is already using "read new" erm, such a read would observe empty result, which is wrong. Drop the optimization. In the scenario above, read will observe all past writes because: 1) writes are still using "write both" 2) writes are switched to "write new" only after all requests which might be using "read old" are done Replica-side coordinators should already route single-key requests to the correct shard, so it's not important as an optimization. This issue shows how assumptions about static sharding are embedded in the current code base and how intra-node migration, by violating those assumptions, can lead to correctness issues.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dbca598e99	replica: Deprecate table::shard_of()	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	6c6ce2d928	tablets: Get rid of tablet_map::get_shard() Its semantics do not fit well with intra-node migration which allow two owning shards. Replace uses with the new has_replica() API.	2024-05-16 00:28:46 +02:00
Amnon Heiman	0c84692c97	replica/table.cc: Add metrics per-table-per-node This patch adds metrics that will be reported per-table per-node. The added metrics (that are part of the per-table per-shard metrics) are: scylla_column_family_cache_hit_rate scylla_column_family_read_latency scylla_column_family_write_latency scylla_column_family_live_disk_space Fixes #18642 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#18645	2024-05-14 07:54:34 +03:00
Jan Ciolek	e0442d7bfa	mv: throttle view update generation for large queries For every mutation applied to the base table we have to generate the corresponding materialized view table updates. In case of simple requests, like INSERT or UPDATE, the number of view updates generated per base table mutation is limited to at most a few view table updates per base table update. The situation is different for DELETE queries, which can delete the whole partitions or clustering ranges. Range deletions are fast on the base table, but for the view table the situation is different. Deleting a single partition in the base table will generate as many singular view updates as there are rows in the deleted partition, which could potentially be in the millions. To prevent OOM view updates are generated in batches of at most 100 rows. There is a loop which generates the next batch of updates, spawns tasks to send them to remote nodes, generates another batch and so on. The problem is that there is no concurrency control - each batch is scheduled to be sent in the background, but the following batch is generated without waiting for the previously generated updates to be sent. This can lead to unbounded concurrency and OOM. To protect against this view update generation should be limited somehow. There is an existing mechanism for limiting view updates - throttling. We keep track of how many pending view updates there are, in the view backlog, and delay responses to the client based on this backlog's fullness. For a well behaved client with limited concurrency this will slow down the amount of incoming requests until it reaches an optimal point. This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything for range DELETEs. A DELETE is a single request that generates millions of view updates, delaying client response doesn't help. The throttling mechanism could be extend to cover this case - we could treat the DELETE request like any other client and force it to wait before sending more updates. This commit implements this approach - before sending the next batch of updates the generator is forced to sleep for a bit of time, calculated using the exisiting throttling equation. The more full the backlog gets the more the generator will have to sleep for, and hopefully this will prevent overloading the system with view updates. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:16:23 +02:00
Botond Dénes	afa870a387	Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho Closes scylladb/scylladb#18616 * github.com:scylladb/scylladb: replica: Make it explicit table's sstable set is immutable replica: avoid reallocations in tablet_sstable_set replica: Avoid compound set if only one sstable set is filled	2024-05-13 14:17:24 +03:00
Pavel Emelyanov	2ce643d06b	table: Directly compare std::optional<shard_id> with shard_id There's a loop that calculates the number of shard matches over a tablet map. The check of the given shard against optional<shard> can be made shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18592	2024-05-13 13:25:05 +03:00
Raphael S. Carvalho	7faba69f28	replica: Make it explicit table's sstable set is immutable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 11:58:08 -03:00
Raphael S. Carvalho	35a0d47408	replica: Avoid compound set if only one sstable set is filled Most of the time only main set is filled, so we can avoid one layer of indirection (= compound set) when maintenance set is empty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk	b4371a0ea0	replica: allocate storage groups dynamically Currently empty storage_groups are allocated for tablets that are not on this shard. Allocate storage groups dynamically, i.e.: - on table creation allocate only storage groups that are on this shard; - allocate a storage group for tablet that is moved to this shard; - deallocate storage group for tablet that is cleaned up. Stop compaction group before it's deallocated. Add a flag to table::cleanup_tablet deciding whether to deallocate sgs and use it in commitlog tests.	2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk	6e1e082e8c	replica: refresh snapshot in compaction_group::cleanup During compaction_group::cleanup sstables set is updated, but row_cache::_underlaying still keeps a shared ptr to the old set. Due to that descriptors to deleted sstables aren't closed. Refresh snapshot in order to store new sstables set in _underlying mutation source.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	c283746b32	replica: add rwlock to storage_group_manager Add rwlock which prevents storage groups from being added/deleted while some other layers itereates over them (or their compaction groups). Add methods to iterate over storage groups with the lock held.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	532653f118	replica: replace table::as_table_state Replace table::as_table_state with table::try_get_table_state_with_static_sharding which throws if a table does not use static sharding.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	cf9913b0b7	compaction: pass compaction group id to reshape_compaction_group Pass compaction group id to shard_reshaping_compaction_task_impl::reshape_compaction_group. Modify table::as_table_state to return table_state of the given compaction group.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	90d618d8c9	replica: open code get_compaction_group in perform_cleanup_compaction Open code get_compaction_group in table::perform_cleanup_compaction as its definition won't be relevant once storage groups are allocated dynamically.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	8505389963	replica: drop single_compaction_group_if_available Drop single_compaction_group_if_available as it's unused.	2024-05-10 14:56:38 +02:00
Avi Kivity	37d32a5f8b	Merge 'Cleanup inactive reads on tablet migration' from Botond Dénes When a tablet is migrated away, any inactive read which might be reading from said tablet, has to be dropped. Otherwise these inactive reads can prevent sstables from being removed and these sstables can potentially survive until the tablet is migrated back and resurrect data. This series introduces the fix as well as a reproducer test. Fixes: https://github.com/scylladb/scylladb/issues/18110 Closes scylladb/scylladb#18179 * github.com:scylladb/scylladb: test: add test for cleaning up cached querier on tablet migration querier: allow injecting cache entry ttl by error injector replica/table: cleanup_tablet(): clear inactive reads for the tablet replica/database: introduce clear_inactive_reads_for_tablet() replica/database: introduce foreach_reader_concurrency_semaphore reader_concurrency_semaphore: add range param to evict_inactive_reads_for_table() reader_concurrency_semaphore: allow storing a range with the inactive reader reader_concurrency_semaphore: avoid detach() in inactive_read_handle::abandon()	2024-05-09 17:34:49 +03:00
Pavel Emelyanov	677e80a4d5	table: Coroutinize table::delete_sstables_atomically() Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18499	2024-05-07 17:10:28 +02:00
Raphael S. Carvalho	b980634ff2	test: Verify tablet cleanup is properly retried on failure Doesn't test only coordinator ability to retry on failure, but also that replica will be able to properly continue cleanup of a storage group from where it left off (when failure happened), not leave any sstables behind. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18426	2024-04-30 19:27:17 +02:00
Botond Dénes	03995d9397	replica/table: cleanup_tablet(): clear inactive reads for the tablet To avoid any resource surviving the cleanup, via some inactive read pinning it. This can cause data resurrection if the tablet is later migrated back and the pinned data source is added back to the tablet.	2024-04-30 01:47:16 -04:00
Botond Dénes	044fd7a3ec	Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself. Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies. Closes scylladb/scylladb#18421 * github.com:scylladb/scylladb: replica: Do not carry view concurrency semaphore pointer around view: Get concurrency semaphore via database, not table view_update_generator: Mark mutate_MV() private view: Move view_update_generator methods' code view: Move table::generate_and_propagate_view_updates into view code view: Move table::populate_views() into view_update_generator class	2024-04-26 10:55:38 +03:00
Raphael S. Carvalho	4a5fdc5814	table: Remove outdated FIXME about sstable spanning multiple tablets The FIXME was added back then because we thought the interface of compaction_group_for_sstable might have to be adjusted if a sstable were allowed to temporarily span multiple tablets until it's split, but we have gone a different path. If a sstable's key range incorrectly spans more than one tablet, that will be considered a bug and an exception is thrown. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18410	2024-04-25 17:21:11 +03:00
Pavel Emelyanov	18cc2cfa31	replica: Generalize snapshot details for single table/snapshot dir There are two places that get total:live stats for a table snapshot -- database::get_snapshot_details() and table::get_snapshot_details(). Both do pretty similar thing -- walk the table/snapshots/ directory, then each of the found sub-directory and accumulate the found files' sizes into snapshot details structure. Both try to tell total from live sizes by checking whether an sstable component found in snapshots is present in the table datadir. The database code does it in a more correct way -- not just checks the file presense, but also compares if it's a hardlink on the snapshot file, while the table code just checks if the file of the same name exists. This patch does both -- makes both database and table call the same helper method for a single snapshot details, and makes the generalized version use more elaborated collision check, thus fixing the per-table details getting behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18347	2024-04-25 17:12:42 +03:00
Pavel Emelyanov	bc4552740f	view: Move view_update_generator methods' code Now when the two methods belong to another class, move the code itself to db/view , where the class itself resides. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:24:20 +03:00
Pavel Emelyanov	c2bf6b43b2	view: Move table::generate_and_propagate_view_updates into view code Similarly to populate_views() method, this one also naturally belongs to view_update_generator class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:20:06 +03:00
Pavel Emelyanov	670c7c925c	view: Move table::populate_views() into view_update_generator class The method in question has little to do with table, effectively it only needs stats and consurrency semaphore. And the semaphore in question is obtained from table indirectly, it really resides on database. On the other hand, the method carries lots of bits from db::view, e.g. the view_update_builder class, memory_usage_of() helper and a bit more. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:17:20 +03:00
Pavel Emelyanov	f5f57dc817	table: No need to open directory in snapshot_exists() In order to check if a snapshot of a certain name exists the checking method opens directory. It can be made with more lightweight call. Also, though not critical, is that it fogets to close it. Coroutinuze the method while at it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18365	2024-04-23 17:19:24 +03:00
Pavel Emelyanov	be5bc38cde	table: Use directory semaphore from sstables manager It's natural for a table to itarate over its sstables, get the semaphore from the manager of sstables, not from database. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Pavel Emelyanov	7e7dd2649b	table: Indentation fix after previous patch Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Pavel Emelyanov	2fced3c557	table: Use directory_semaphore for rate-limited snapshot taking The table::take_snapshot() limits its parallelizm with the help of direcoty semaphore already, but implements it "by hand". There's already parallel_for_each() method on the dir.sem. class that does exactly that. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Pavel Emelyanov	0d2178202d	table: Use smp::all_cpus() to iterate over all CPUs locally Currently it uses irange(0, smp::count0), but seastar provides convenient helper call for the very same range object. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Pavel Emelyanov	5cf53e670d	replica: Remove unused ex variable from table::take_snapshot Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18215	2024-04-16 07:16:38 +03:00
Raphael S. Carvalho	9f93dd9fa3	replica: Use flat_hash_map for tablet storage The reason that we want to switch to flat_hash_map is that only a small subset of tablets will be allocated on any given shard, therefore it's wasteful to use a sparse array, and iterations are slow. Also, the map gives greater development flexibility as one doesn't have to worry about empty entries. perf result: -- reads scylla_with_chunked_vector-read-no-tablets.txt median 73223.28 tps ( 62.3 allocs/op, 13.3 tasks/op, 41932 insns/op, 0 errors) median 74952.87 tps ( 62.3 allocs/op, 13.3 tasks/op, 41969 insns/op, 0 errors) median 73016.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41934 insns/op, 0 errors) median 74078.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41938 insns/op, 0 errors) median 75323.07 tps ( 62.3 allocs/op, 13.3 tasks/op, 41944 insns/op, 0 errors) scylla_with_hash_map-read-no-tablets.txt median 74963.30 tps ( 62.3 allocs/op, 13.3 tasks/op, 41926 insns/op, 0 errors) median 74032.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41918 insns/op, 0 errors) median 74850.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41937 insns/op, 0 errors) median 74239.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41921 insns/op, 0 errors) median 74798.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41925 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-1.txt median 74234.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41903 insns/op, 0 errors) median 75775.98 tps ( 62.1 allocs/op, 13.3 tasks/op, 41910 insns/op, 0 errors) median 76481.56 tps ( 62.1 allocs/op, 13.2 tasks/op, 41874 insns/op, 0 errors) median 74056.67 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) median 75287.68 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) scylla_with_hash_map-read-tablets-1.txt median 75613.63 tps ( 62.1 allocs/op, 13.2 tasks/op, 41990 insns/op, 0 errors) median 74819.51 tps ( 62.1 allocs/op, 13.2 tasks/op, 41973 insns/op, 0 errors) median 75648.41 tps ( 62.1 allocs/op, 13.3 tasks/op, 42025 insns/op, 0 errors) median 74170.89 tps ( 62.1 allocs/op, 13.2 tasks/op, 42002 insns/op, 0 errors) median 75447.72 tps ( 62.1 allocs/op, 13.3 tasks/op, 41952 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-128.txt median 73788.57 tps ( 62.1 allocs/op, 13.2 tasks/op, 41956 insns/op, 0 errors) median 76563.63 tps ( 62.1 allocs/op, 13.3 tasks/op, 42006 insns/op, 0 errors) median 75536.12 tps ( 62.1 allocs/op, 13.2 tasks/op, 42005 insns/op, 0 errors) median 74679.17 tps ( 62.1 allocs/op, 13.3 tasks/op, 41958 insns/op, 0 errors) median 75380.95 tps ( 62.1 allocs/op, 13.2 tasks/op, 41946 insns/op, 0 errors) scylla_with_hash_map-read-tablets-128.txt median 75459.99 tps ( 62.1 allocs/op, 13.3 tasks/op, 42055 insns/op, 0 errors) median 74280.11 tps ( 62.1 allocs/op, 13.3 tasks/op, 42085 insns/op, 0 errors) median 74502.61 tps ( 62.1 allocs/op, 13.3 tasks/op, 42063 insns/op, 0 errors) median 74692.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41994 insns/op, 0 errors) median 75402.64 tps ( 62.1 allocs/op, 13.3 tasks/op, 42015 insns/op, 0 errors) -- writes scylla_with_chunked_vector-write-no-tablets.txt median 68635.17 tps ( 58.4 allocs/op, 13.3 tasks/op, 52709 insns/op, 0 errors) median 68716.36 tps ( 58.4 allocs/op, 13.3 tasks/op, 52691 insns/op, 0 errors) median 68512.76 tps ( 58.4 allocs/op, 13.3 tasks/op, 52721 insns/op, 0 errors) median 68606.14 tps ( 58.4 allocs/op, 13.3 tasks/op, 52696 insns/op, 0 errors) median 68619.25 tps ( 58.4 allocs/op, 13.3 tasks/op, 52697 insns/op, 0 errors) scylla_with_hash_map-write-no-tablets.txt median 67678.10 tps ( 58.4 allocs/op, 13.3 tasks/op, 52723 insns/op, 0 errors) median 67966.06 tps ( 58.4 allocs/op, 13.3 tasks/op, 52736 insns/op, 0 errors) median 67881.47 tps ( 58.4 allocs/op, 13.3 tasks/op, 52743 insns/op, 0 errors) median 67856.81 tps ( 58.4 allocs/op, 13.3 tasks/op, 52730 insns/op, 0 errors) median 67812.58 tps ( 58.4 allocs/op, 13.3 tasks/op, 52740 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-1.txt median 67741.83 tps ( 58.4 allocs/op, 13.3 tasks/op, 53425 insns/op, 0 errors) median 68014.20 tps ( 58.4 allocs/op, 13.3 tasks/op, 53455 insns/op, 0 errors) median 68228.48 tps ( 58.4 allocs/op, 13.3 tasks/op, 53447 insns/op, 0 errors) median 67950.96 tps ( 58.4 allocs/op, 13.3 tasks/op, 53443 insns/op, 0 errors) median 67832.69 tps ( 58.4 allocs/op, 13.3 tasks/op, 53462 insns/op, 0 errors) scylla_with_hash_map-write-tablets-1.txt median 66873.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53548 insns/op, 0 errors) median 67568.23 tps ( 58.4 allocs/op, 13.3 tasks/op, 53547 insns/op, 0 errors) median 67653.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53525 insns/op, 0 errors) median 67389.21 tps ( 58.4 allocs/op, 13.3 tasks/op, 53536 insns/op, 0 errors) median 67437.91 tps ( 58.4 allocs/op, 13.3 tasks/op, 53537 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-128.txt median 67115.41 tps ( 58.3 allocs/op, 13.3 tasks/op, 53341 insns/op, 0 errors) median 66836.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53342 insns/op, 0 errors) median 67214.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53303 insns/op, 0 errors) median 67198.25 tps ( 58.3 allocs/op, 13.3 tasks/op, 53347 insns/op, 0 errors) median 67368.78 tps ( 58.3 allocs/op, 13.3 tasks/op, 53374 insns/op, 0 errors) scylla_with_hash_map-write-tablets-128.txt median 66273.50 tps ( 58.3 allocs/op, 13.3 tasks/op, 53400 insns/op, 0 errors) median 66564.89 tps ( 58.3 allocs/op, 13.3 tasks/op, 53432 insns/op, 0 errors) median 66568.52 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) median 66368.00 tps ( 58.3 allocs/op, 13.3 tasks/op, 53441 insns/op, 0 errors) median 66293.55 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) Fixes #18010. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18093	2024-04-04 16:25:48 +03:00
Raphael S. Carvalho	29f9f7594f	replica: Kill table::storage_group_id_for_token() storage_group_id_for_token() was only needed from within tablet_storage_group_manager, so we can kill table::storage_group_id_for_token(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18134	2024-04-02 09:23:23 +03:00

1 2 3 4 5 ...

477 Commits