scylladb

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	5d8543221b	replica: fix copy constructor of tablet_sstable_set Remove the existing copy constructor to enable the use of the implicit copy constructor. This fixes the issue of `_sstable_set_ids` not being copied in the current copy constructor. Fixes #19519 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> (cherry picked from commit `44583eed9e`)	2024-08-19 12:13:10 +00:00
Raphael S. Carvalho	721f0ff46d	replica: Fix race between split compaction and migration After removal of rwlock (`53a6ec05ed`), the race was introduced because the order that compaction groups of a tablet are closed, is no longer deterministic. Some background first: Split compaction runs in main (unsplit) group, and adds sstable to left and right groups on completion. The race works as follow: 1) split compaction starts on main group of tablet X 2) tablet X reaches cleanup stage, so its compaction groups are closed in parallel 3) left or right group are closed before main (more likely when only main has flush work to do) 4) split compaction completes, and adds sstable to left and right 5) if e.g left is closed, adjusting backlog tracker will trigger an exception, and since that happens in row cache update's execute(), node crashes. The problem manifested as follow: [shard 0: gms] raft_topology - Initiating tablet cleanup of 5739b9b0-49d4-11ef-828f-770894013415:15 on 102a904a-0b15-4661-ba3f-f9085a5ad03c:0 ... [shard 0:strm] compaction - [Split keyspace1.standard1 009e2f80-49e5-11ef-85e3-7161200fb137] Splitting [/var/lib/scylla/data/keyspace1/...] ... [shard 0:strm] cache - Fatal error during cache update: std::out_of_range (Compaction state for table [0x600007772740] not found), at: ... -------- seastar::continuation<seastar::internal::promise_base_with_type<void>, row_cache::do_update(... -------- seastar::internal::do_with_state<std::tuple<row_cache::external_updater, std::function<seastar::future<void> ()> >, seastar::future<void> > -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::internal::coroutine_traits_base<void>::promise_type -------- seastar::(anonymous namespace)::thread_wake_task -------- seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::async<sstables::compaction::run(... seastar::continuation<seastar::internal::promise_base_with_type<sstables::compaction_result>, seastar::future<sstables::compaction_resu... From the log above, it can be seen cache update failure happens under streaming sched group and during compaction completion, which was good evidence to the cause. Problem was reproduced locally with the help of tablet shuffling. Fixes: #19873. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `5af1f41ecd`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	31451ec2a0	replica: remove rwlock for protecting iteration over storage group map rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Check documentation added to compaction_group.hh to understand how concurrent iterations and updates to the map work without the rwlock. Yielding variants that iterate over groups are no longer returning group id since id stability can no longer be guaranteed without serializing split finalization and iteration. Fixes #18821. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `c539b7c861`)	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	f6da439ac6	replica: remove linear search when picking memtable_list for range scan with tablets with tablets, we're expected to have a worst of ~100 tablets in a given table and shard, so let's avoid linear search when looking for the memtable_list in a range scan. we're bounded by ~100 elements, so shouldn't be a big problem, but it's an inefficiency we can easily get rid of. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#19286 (cherry picked from commit `f143f5b90d`) Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-08-13 12:26:13 -03:00
Raphael S. Carvalho	39eb44dfa0	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `ad5c5bca5f`)	2024-08-13 12:26:11 -03:00
Michael Litvak	ccd01caed8	db: fix waiting for counter update operations on table stop When a table is dropped it should wait for all pending operations in the table before the table is destroyed, because the operations may use the table's resources. With counter update operations, currently this is not the case. The table may be destroyed while there is a counter update operation in progress, causing an assert to be triggered due to a resource being destroyed while it's in use. The reason the operation is not waited for is a mistake in the lifetime management of the object representing the write in progress. The commit fixes it so the object lives for the duration of the entire counter update operation, by moving it to the `do_with` list. Fixes scylladb/scylla-enterprise#4475 Closes scylladb/scylladb#20017	2024-08-05 12:52:32 +02:00
Botond Dénes	052cef2621	reader_concurrency_semaphore: add cpu_concurrency constructor parameter In the case of the user semaphore, this receives the new reader_concurrency_semaphore_cpu_limit config item. Not used yet. (cherry picked from commit `59faa6d4ff`)	2024-07-08 08:12:20 +03:00
Raphael S. Carvalho	56f551f740	replica: don't expose compaction_group to reshape task compaction_group sits in replica layer and compaction layer is supposed to talk to it through compaction::table_state only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `b8bd4c51c2`)	2024-06-20 20:41:41 +00:00
Gleb Natapov	0e49180cef	topology coordinator: add more trace level logging for debugging Add more logging that provide more visibility into what happens during topology loading. Message-ID: <ZnE5OAmUbExVZMWA@scylladb.com> (cherry picked from commit `fb764720d3`)	2024-06-18 16:38:51 +02:00
Botond Dénes	b18d9e5d0d	Merge '[Backport 6.0] make enable_compacting_data_for_streaming_and_repair truly live-update' from ScyllaDB This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken. This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series. Fixes: https://github.com/scylladb/scylladb/issues/18674 - [x] This patch has to be backported because it fixes broken functionality (cherry picked from commit `dbccb61636`) (cherry picked from commit `4590026b38`) (cherry picked from commit `feea609e37`) (cherry picked from commit `0c61b1822c`) (cherry picked from commit `8ef4fbdb87`) Refs #18705 Closes scylladb/scylladb#19240 * github.com:scylladb/scylladb: test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update test/pylib: rest_client: add get_injection() api/error_injection: add getter for error_injection utils/error_injection: add set_parameter() replica/database: fix live-update enable_compacting_data_for_streaming_and_repair	2024-06-13 12:45:23 +03:00
Botond Dénes	0d13c51dd4	test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update Avoid this the live-update feature of this config item breaking silently. (cherry picked from commit `8ef4fbdb87`)	2024-06-11 17:32:37 +00:00
Botond Dénes	d121fc1264	replica/database: fix live-update enable_compacting_data_for_streaming_and_repair This config item is propagated to the table object via table::config. Although the field in table::config, used to propagate the value, was utils::updateable_value<T>, it was assigned a constant and so the live-update chain was broken. This patch fixes this. (cherry picked from commit `dbccb61636`)	2024-06-11 17:32:36 +00:00
Raphael S. Carvalho	d4c3a43b34	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `7b41630299`) Closes scylladb/scylladb#19229	2024-06-11 18:12:43 +03:00
Botond Dénes	341c29bd74	Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes https://github.com/scylladb/scylladb/issues/18085. (cherry picked from commit `abcc68dbe7`) (cherry picked from commit `551bf9dd58`) (cherry picked from commit `e7246751b6`) Refs https://github.com/scylladb/scylladb/pull/18287 Closes scylladb/scylladb#19095 * github.com:scylladb/scylladb: topology_experimental_raft/test_tablets: restore usage of check_with_down test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-06-05 13:06:32 +03:00
Raphael S. Carvalho	9a341a65af	replica: Only consume memtable of the tablet intersecting with range read storage_proxy is responsible for intersecting the range of the read with tablets, and calling replica with a single tablet range, therefore it makes sense to avoid touching memtables of tablets that don't intersect with a particular range. Note this is a performance issue, not correctness one, as memtable readers that don't intersect with current range won't produce any data, but cpu is wasted until that's realized (they're added to list of readers in mutation_reader_merger, more allocations, more data sources to peek into, etc). That's also important for streaming e.g. after decommission, that will consume one tablet at a time through a reader, so we don't want memtables of streamed tablets (that weren't cleaned up yet) to be consumed. Refs #18904. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `832fb43fb4`) Closes scylladb/scylladb#18983	2024-06-04 16:17:47 +03:00
Raphael S. Carvalho	3cb71c5b88	replica: Fix race of tablet snapshot with compaction tablet snapshot, used by migration, can race with compaction and can find files deleted. That won't cause data loss because the error is propagated back into the coordinator that decides to retry streaming stage. So the consequence is delayed migration, which might in turn reduce node operation throughput (e.g. when decommissioning a node). It should be rare though, so shouldn't have drastic consequences. Fixes #18977. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `b396b05e20`) Closes scylladb/scylladb#19008	2024-06-03 12:21:52 +03:00
Marcin Maliszkiewicz	cbf47319c1	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769 (cherry picked from commit `2ab143fb40`) [avi: adjust test/alternator/suite.yaml to reflect new keyspace]	2024-06-02 21:41:14 +03:00
Raphael S. Carvalho	55a45e3486	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `abcc68dbe7`)	2024-05-27 18:21:21 +00:00
Raphael S. Carvalho	13f8486cd7	replica: Fix tablet's compaction_groups_for_token_range() with unowned range File-based tablet streaming calls every shard to return data of every group that intersects with a given range. After dynamic group allocation, that breaks as the tablet range will only be present in a single shard, so an exception is thrown causing migration to halt during streaming phase. Ideally, only one shard is invoked, but that's out of the scope of this fix and compaction_groups_for_token_range() should return empty result if none of the local groups intersect with the range. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `eb8ef38543`) Closes scylladb/scylladb#18859	2024-05-27 15:20:04 +03:00
Piotr Dulikowski	9820472277	main: introduce schema commitlog scheduling group Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However: - It is important and significant enough that it should run in a scheduling group that is separate from the main one, - It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues. Therefore, introduce a new scheduling group dedicated to the schema commitlog. Fixes: scylladb/scylladb#15566 Closes scylladb/scylladb#18715	2024-05-21 11:29:57 +02:00
Botond Dénes	f239339a29	Merge 'Improve modularity of some per-table API endpoints' from Pavel Emelyanov There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header. Closes scylladb/scylladb#18703 * github.com:scylladb/scylladb: api,database: Move auto-compaction toggle guard api: Move some table manipulation helpers from storage_service api: Move table-related calls from storage_service domain api: Reimplement some endpoints using existing helpers api: Lost unset of tombstone-gc endpoints	2024-05-20 18:01:54 +03:00
Avi Kivity	61505d057e	Merge 'Sort user-defined types in describe statements' from Michał Jadwiszczak User-defined types can depend on each other, creating directed acyclic graph. In order to support restoring schema from `DESC SCHEMA`, UDTs should be ordered topologically, not alphabetically as it was till now. This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema. Fixes #18539 Closes scylladb/scylladb#18302 * github.com:scylladb/scylladb: test/cql-pytest/test_describe: add test for UDTs ordering cql3/statements/describe_statement: UDTs topological sorting cql3/statements/describe_statement: allow to skip alphabetical sorting types: add a method to get all referenced user types db/cql_type_parser: use generic topological sorting db/cql_type_parses: futurize raw_builder::build() test/boost: add test for topological sorting utils: introduce generic topological sorting algorithm	2024-05-20 16:58:17 +03:00
Botond Dénes	e1c4e6c151	Merge 'sstables_manager: use maintenance scheduling group to run components reload fiber' from Lakshmi Narayanan Sreethar PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group. Fixes #18675 Closes scylladb/scylladb#18721 * github.com:scylladb/scylladb: sstables_manager: use maintenance scheduling group to run components reload fiber sstables_manager: add member to store maintenance scheduling group	2024-05-20 16:38:42 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Lakshmi Narayanan Sreethar	79f6746298	sstables_manager: add member to store maintenance scheduling group Store that maintenance scheduling group inside the sstables_manager. The next patch will use this to run the components reloader fiber. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-05-19 15:23:45 +05:30
Avi Kivity	2fbd78c769	feature: grandfather DIGEST_FOR_NULL_VALUES The DIGEST_FOR_NULL_VALUES feature was added in `21a77612b3` (2020; 4.4) and can now be assumed to be always present. The hasher which it invoked is removed.	2024-05-18 00:24:00 +03:00
Pavel Emelyanov	31d05925cc	api,database: Move auto-compaction toggle guard Toggling per-table auto-compaction enabling bit is guarded with on-database boolean and raii guard. It's only used by a single api/column_family.cc file, so it can live there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-16 14:42:51 +03:00
Michał Jadwiszczak	3830f3bd23	db/cql_type_parses: futurize raw_builder::build() In order to use generic topological sort, build() method needs to return future.	2024-05-16 13:30:03 +02:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	679baff25a	dht, replica: Remove deprecated sharder APIs	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	539460dd71	db: do_apply_many() avoid deprecated sharded API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	0f50504c39	replica: mutation_dump: Avoid deprecated sharder API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7c03646f99	table: Remove optimization which returns empty reader when key is not owned by the shard This check would lead to correctness issues with intra-node migration because the shard may switch during read, from "read old" to "read new". If the coordinator used "read old" for shard routing, but table on the old shard is already using "read new" erm, such a read would observe empty result, which is wrong. Drop the optimization. In the scenario above, read will observe all past writes because: 1) writes are still using "write both" 2) writes are switched to "write new" only after all requests which might be using "read old" are done Replica-side coordinators should already route single-key requests to the correct shard, so it's not important as an optimization. This issue shows how assumptions about static sharding are embedded in the current code base and how intra-node migration, by violating those assumptions, can lead to correctness issues.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dbca598e99	replica: Deprecate table::shard_of()	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	6c6ce2d928	tablets: Get rid of tablet_map::get_shard() Its semantics do not fit well with intra-node migration which allow two owning shards. Replace uses with the new has_replica() API.	2024-05-16 00:28:46 +02:00
Pavel Emelyanov	59aec1f300	database: Don't break namespace withexternal alias The namespace replica is broken in the middle with sstable_list alias, while the latter can be declared earlier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18664	2024-05-14 16:45:20 +03:00
Amnon Heiman	0c84692c97	replica/table.cc: Add metrics per-table-per-node This patch adds metrics that will be reported per-table per-node. The added metrics (that are part of the per-table per-shard metrics) are: scylla_column_family_cache_hit_rate scylla_column_family_read_latency scylla_column_family_write_latency scylla_column_family_live_disk_space Fixes #18642 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#18645	2024-05-14 07:54:34 +03:00
Raphael S. Carvalho	0b2ec3063c	sstables: Fix incremental_reader_selector (for range reads) with tablets incremental_reader_selector is the mechanism for incremental comsumption of disjoint sstables on range reads. tablet_sstable_set was implemented, such that selector is efficient with tablets. The problem is selector is vnode addicted and will only consider a given set exhausted when maximum token is reached. With tablets, that means a range read on first tablet of a given shard will also consume other tablets living in the same shard. That results in combined reader having to work with empty sstable readers of tablets that don't intersect with the range of the read. It won't cause extra I/O because the underlying sstables don't intersect with the range of the read. It's only unnecessary CPU work, as it involves creating readers (= allocation), feeding them into combined reader, which will in turn invoke the sstable readers only to realize they don't have any data for that range. With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables per tablet, there will be this amount of readers (empty or not): (100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions. ~5000 times more readers, it can be quite significant additional cpu work, even though I/O dominates the most in scans. It's an inefficiency that we rather get rid of. The behavior can be observed from logs (there's 1 sstable for each of 4 tablets, but note how readers are created for every single one of them when reading only 1 tablet range): ``` table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}] ``` Fix is about making sure the tablet set won't select past the supplied range of the read. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18556	2024-05-14 07:43:22 +03:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Jan Ciolek	e0442d7bfa	mv: throttle view update generation for large queries For every mutation applied to the base table we have to generate the corresponding materialized view table updates. In case of simple requests, like INSERT or UPDATE, the number of view updates generated per base table mutation is limited to at most a few view table updates per base table update. The situation is different for DELETE queries, which can delete the whole partitions or clustering ranges. Range deletions are fast on the base table, but for the view table the situation is different. Deleting a single partition in the base table will generate as many singular view updates as there are rows in the deleted partition, which could potentially be in the millions. To prevent OOM view updates are generated in batches of at most 100 rows. There is a loop which generates the next batch of updates, spawns tasks to send them to remote nodes, generates another batch and so on. The problem is that there is no concurrency control - each batch is scheduled to be sent in the background, but the following batch is generated without waiting for the previously generated updates to be sent. This can lead to unbounded concurrency and OOM. To protect against this view update generation should be limited somehow. There is an existing mechanism for limiting view updates - throttling. We keep track of how many pending view updates there are, in the view backlog, and delay responses to the client based on this backlog's fullness. For a well behaved client with limited concurrency this will slow down the amount of incoming requests until it reaches an optimal point. This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything for range DELETEs. A DELETE is a single request that generates millions of view updates, delaying client response doesn't help. The throttling mechanism could be extend to cover this case - we could treat the DELETE request like any other client and force it to wait before sending more updates. This commit implements this approach - before sending the next batch of updates the generator is forced to sleep for a bit of time, calculated using the exisiting throttling equation. The more full the backlog gets the more the generator will have to sleep for, and hopefully this will prevent overloading the system with view updates. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:16:23 +02:00
Botond Dénes	afa870a387	Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho Closes scylladb/scylladb#18616 * github.com:scylladb/scylladb: replica: Make it explicit table's sstable set is immutable replica: avoid reallocations in tablet_sstable_set replica: Avoid compound set if only one sstable set is filled	2024-05-13 14:17:24 +03:00
Pavel Emelyanov	2ce643d06b	table: Directly compare std::optional<shard_id> with shard_id There's a loop that calculates the number of shard matches over a tablet map. The check of the given shard against optional<shard> can be made shorter. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18592	2024-05-13 13:25:05 +03:00
Raphael S. Carvalho	7faba69f28	replica: Make it explicit table's sstable set is immutable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 11:58:08 -03:00
Raphael S. Carvalho	55c0272b68	replica: avoid reallocations in tablet_sstable_set reserve upfront wherever possible to avoid reallocations. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 10:44:39 -03:00
Raphael S. Carvalho	35a0d47408	replica: Avoid compound set if only one sstable set is filled Most of the time only main set is filled, so we can avoid one layer of indirection (= compound set) when maintenance set is empty. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 10:44:34 -03:00
Aleksandra Martyniuk	b4371a0ea0	replica: allocate storage groups dynamically Currently empty storage_groups are allocated for tablets that are not on this shard. Allocate storage groups dynamically, i.e.: - on table creation allocate only storage groups that are on this shard; - allocate a storage group for tablet that is moved to this shard; - deallocate storage group for tablet that is cleaned up. Stop compaction group before it's deallocated. Add a flag to table::cleanup_tablet deciding whether to deallocate sgs and use it in commitlog tests.	2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk	6e1e082e8c	replica: refresh snapshot in compaction_group::cleanup During compaction_group::cleanup sstables set is updated, but row_cache::_underlaying still keeps a shared ptr to the old set. Due to that descriptors to deleted sstables aren't closed. Refresh snapshot in order to store new sstables set in _underlying mutation source.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	c283746b32	replica: add rwlock to storage_group_manager Add rwlock which prevents storage groups from being added/deleted while some other layers itereates over them (or their compaction groups). Add methods to iterate over storage groups with the lock held.	2024-05-10 14:56:38 +02:00

1 2 3 4 5 ...

1175 Commits