scylladb

Author	SHA1	Message	Date
Raphael S. Carvalho	39eb44dfa0	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `ad5c5bca5f`)	2024-08-13 12:26:11 -03:00
Raphael S. Carvalho	56f551f740	replica: don't expose compaction_group to reshape task compaction_group sits in replica layer and compaction layer is supposed to talk to it through compaction::table_state only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `b8bd4c51c2`)	2024-06-20 20:41:41 +00:00
Botond Dénes	341c29bd74	Merge '[Backport 6.0] storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes https://github.com/scylladb/scylladb/issues/18085. (cherry picked from commit `abcc68dbe7`) (cherry picked from commit `551bf9dd58`) (cherry picked from commit `e7246751b6`) Refs https://github.com/scylladb/scylladb/pull/18287 Closes scylladb/scylladb#19095 * github.com:scylladb/scylladb: topology_experimental_raft/test_tablets: restore usage of check_with_down test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-06-05 13:06:32 +03:00
Raphael S. Carvalho	3cb71c5b88	replica: Fix race of tablet snapshot with compaction tablet snapshot, used by migration, can race with compaction and can find files deleted. That won't cause data loss because the error is propagated back into the coordinator that decides to retry streaming stage. So the consequence is delayed migration, which might in turn reduce node operation throughput (e.g. when decommissioning a node). It should be rare though, so shouldn't have drastic consequences. Fixes #18977. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `b396b05e20`) Closes scylladb/scylladb#19008	2024-06-03 12:21:52 +03:00
Raphael S. Carvalho	55a45e3486	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> (cherry picked from commit `abcc68dbe7`)	2024-05-27 18:21:21 +00:00
Piotr Dulikowski	9820472277	main: introduce schema commitlog scheduling group Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However: - It is important and significant enough that it should run in a scheduling group that is separate from the main one, - It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues. Therefore, introduce a new scheduling group dedicated to the schema commitlog. Fixes: scylladb/scylladb#15566 Closes scylladb/scylladb#18715	2024-05-21 11:29:57 +02:00
Botond Dénes	f239339a29	Merge 'Improve modularity of some per-table API endpoints' from Pavel Emelyanov There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header. Closes scylladb/scylladb#18703 * github.com:scylladb/scylladb: api,database: Move auto-compaction toggle guard api: Move some table manipulation helpers from storage_service api: Move table-related calls from storage_service domain api: Reimplement some endpoints using existing helpers api: Lost unset of tombstone-gc endpoints	2024-05-20 18:01:54 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Pavel Emelyanov	31d05925cc	api,database: Move auto-compaction toggle guard Toggling per-table auto-compaction enabling bit is guarded with on-database boolean and raii guard. It's only used by a single api/column_family.cc file, so it can live there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-16 14:42:51 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	679baff25a	dht, replica: Remove deprecated sharder APIs	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dbca598e99	replica: Deprecate table::shard_of()	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Pavel Emelyanov	59aec1f300	database: Don't break namespace withexternal alias The namespace replica is broken in the middle with sstable_list alias, while the latter can be declared earlier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18664	2024-05-14 16:45:20 +03:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Botond Dénes	afa870a387	Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho Closes scylladb/scylladb#18616 * github.com:scylladb/scylladb: replica: Make it explicit table's sstable set is immutable replica: avoid reallocations in tablet_sstable_set replica: Avoid compound set if only one sstable set is filled	2024-05-13 14:17:24 +03:00
Raphael S. Carvalho	7faba69f28	replica: Make it explicit table's sstable set is immutable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 11:58:08 -03:00
Aleksandra Martyniuk	b4371a0ea0	replica: allocate storage groups dynamically Currently empty storage_groups are allocated for tablets that are not on this shard. Allocate storage groups dynamically, i.e.: - on table creation allocate only storage groups that are on this shard; - allocate a storage group for tablet that is moved to this shard; - deallocate storage group for tablet that is cleaned up. Stop compaction group before it's deallocated. Add a flag to table::cleanup_tablet deciding whether to deallocate sgs and use it in commitlog tests.	2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk	c283746b32	replica: add rwlock to storage_group_manager Add rwlock which prevents storage groups from being added/deleted while some other layers itereates over them (or their compaction groups). Add methods to iterate over storage groups with the lock held.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	532653f118	replica: replace table::as_table_state Replace table::as_table_state with table::try_get_table_state_with_static_sharding which throws if a table does not use static sharding.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	cf9913b0b7	compaction: pass compaction group id to reshape_compaction_group Pass compaction group id to shard_reshaping_compaction_task_impl::reshape_compaction_group. Modify table::as_table_state to return table_state of the given compaction group.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	8505389963	replica: drop single_compaction_group_if_available Drop single_compaction_group_if_available as it's unused.	2024-05-10 14:56:38 +02:00
Botond Dénes	a062e3f650	replica/database: introduce clear_inactive_reads_for_tablet() To be used on the tablet cleanup path, to clear any inactive read which might be related to the cleaned-up tablet.	2024-04-30 01:44:03 -04:00
Botond Dénes	338af5055c	replica/database: introduce foreach_reader_concurrency_semaphore Currently we have a single method -- detach_column_family() -- which does something with each semaphore. Soon there will be another one. Introduce a method to do something with all semaphores, to make this smoother. Enterprise has a different set of semaphores, and this will reduce friction.	2024-04-30 01:43:56 -04:00
Botond Dénes	044fd7a3ec	Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself. Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies. Closes scylladb/scylladb#18421 * github.com:scylladb/scylladb: replica: Do not carry view concurrency semaphore pointer around view: Get concurrency semaphore via database, not table view_update_generator: Mark mutate_MV() private view: Move view_update_generator methods' code view: Move table::generate_and_propagate_view_updates into view code view: Move table::populate_views() into view_update_generator class	2024-04-26 10:55:38 +03:00
Pavel Emelyanov	18cc2cfa31	replica: Generalize snapshot details for single table/snapshot dir There are two places that get total:live stats for a table snapshot -- database::get_snapshot_details() and table::get_snapshot_details(). Both do pretty similar thing -- walk the table/snapshots/ directory, then each of the found sub-directory and accumulate the found files' sizes into snapshot details structure. Both try to tell total from live sizes by checking whether an sstable component found in snapshots is present in the table datadir. The database code does it in a more correct way -- not just checks the file presense, but also compares if it's a hardlink on the snapshot file, while the table code just checks if the file of the same name exists. This patch does both -- makes both database and table call the same helper method for a single snapshot details, and makes the generalized version use more elaborated collision check, thus fixing the per-table details getting behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18347	2024-04-25 17:12:42 +03:00
Pavel Emelyanov	8aaa09ee97	replica: Do not carry view concurrency semaphore pointer around Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:27:43 +03:00
Pavel Emelyanov	2ee7c41139	view: Get concurrency semaphore via database, not table The _view_update_concurrency_sem field on database propagates itself via keyspace config down to table config and view_update_generator then grabs one via table:: helper. That's an overkil, view_update_generator has direct reference on the database and can get this semaphore from there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:25:57 +03:00
Pavel Emelyanov	c2bf6b43b2	view: Move table::generate_and_propagate_view_updates into view code Similarly to populate_views() method, this one also naturally belongs to view_update_generator class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:20:06 +03:00
Pavel Emelyanov	670c7c925c	view: Move table::populate_views() into view_update_generator class The method in question has little to do with table, effectively it only needs stats and consurrency semaphore. And the semaphore in question is obtained from table indirectly, it really resides on database. On the other hand, the method carries lots of bits from db::view, e.g. the view_update_builder class, memory_usage_of() helper and a bit more. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:17:20 +03:00
Botond Dénes	572003c469	Merge 'Cleanup the way snapshot details are propagated via API' from Pavel Emelyanov There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several snapshot_details aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls. Closes scylladb/scylladb#18317 * github.com:scylladb/scylladb: snapshot_ctl: Brush up true_snapshots_size() internals snapshot_ctl: Remove unused details struct snapshot_ctl: No double recoding of details database,snapshots: Move database::snapshot_details into snapshot_ctl database,snapshots: Make database::get_snapshot_details() return map, not vector table,snapshots: Move table::snapshot_details into snapshot_ctl	2024-04-23 16:28:25 +03:00
Pavel Emelyanov	8ec3f057a8	database,snapshots: Move database::snapshot_details into snapshot_ctl Similarly to how it looks like for table::snapshot_details Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 20:04:29 +03:00
Pavel Emelyanov	f6bc283bbb	database,snapshots: Make database::get_snapshot_details() return map, not vector So that it's in-sync with table::get_snapshot_details(). Next patches will improve this place even further. Also, there can be many snapshots and vector can grow large, but that's less of an issue here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 20:04:25 +03:00
Pavel Emelyanov	a36c13beb3	table,snapshots: Move table::snapshot_details into snapshot_ctl Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 19:59:34 +03:00
Pavel Emelyanov	ba58b71eea	database: Keep local directory_semaphore to initialize sstables managers Now database is constructed with sharded<directory_semaphore>, but it no longer needs sharded, local is enough. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Pavel Emelyanov	53909da390	database: Don't reference directory_semaphore It was only used by table taking snapshot code. Now it uses sstables manager's reference and database can stop carrying it around. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00
Raphael S. Carvalho	9f93dd9fa3	replica: Use flat_hash_map for tablet storage The reason that we want to switch to flat_hash_map is that only a small subset of tablets will be allocated on any given shard, therefore it's wasteful to use a sparse array, and iterations are slow. Also, the map gives greater development flexibility as one doesn't have to worry about empty entries. perf result: -- reads scylla_with_chunked_vector-read-no-tablets.txt median 73223.28 tps ( 62.3 allocs/op, 13.3 tasks/op, 41932 insns/op, 0 errors) median 74952.87 tps ( 62.3 allocs/op, 13.3 tasks/op, 41969 insns/op, 0 errors) median 73016.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41934 insns/op, 0 errors) median 74078.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41938 insns/op, 0 errors) median 75323.07 tps ( 62.3 allocs/op, 13.3 tasks/op, 41944 insns/op, 0 errors) scylla_with_hash_map-read-no-tablets.txt median 74963.30 tps ( 62.3 allocs/op, 13.3 tasks/op, 41926 insns/op, 0 errors) median 74032.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41918 insns/op, 0 errors) median 74850.09 tps ( 62.3 allocs/op, 13.3 tasks/op, 41937 insns/op, 0 errors) median 74239.37 tps ( 62.3 allocs/op, 13.3 tasks/op, 41921 insns/op, 0 errors) median 74798.14 tps ( 62.3 allocs/op, 13.3 tasks/op, 41925 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-1.txt median 74234.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41903 insns/op, 0 errors) median 75775.98 tps ( 62.1 allocs/op, 13.3 tasks/op, 41910 insns/op, 0 errors) median 76481.56 tps ( 62.1 allocs/op, 13.2 tasks/op, 41874 insns/op, 0 errors) median 74056.67 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) median 75287.68 tps ( 62.1 allocs/op, 13.3 tasks/op, 41894 insns/op, 0 errors) scylla_with_hash_map-read-tablets-1.txt median 75613.63 tps ( 62.1 allocs/op, 13.2 tasks/op, 41990 insns/op, 0 errors) median 74819.51 tps ( 62.1 allocs/op, 13.2 tasks/op, 41973 insns/op, 0 errors) median 75648.41 tps ( 62.1 allocs/op, 13.3 tasks/op, 42025 insns/op, 0 errors) median 74170.89 tps ( 62.1 allocs/op, 13.2 tasks/op, 42002 insns/op, 0 errors) median 75447.72 tps ( 62.1 allocs/op, 13.3 tasks/op, 41952 insns/op, 0 errors) scylla_with_chunked_vector-read-tablets-128.txt median 73788.57 tps ( 62.1 allocs/op, 13.2 tasks/op, 41956 insns/op, 0 errors) median 76563.63 tps ( 62.1 allocs/op, 13.3 tasks/op, 42006 insns/op, 0 errors) median 75536.12 tps ( 62.1 allocs/op, 13.2 tasks/op, 42005 insns/op, 0 errors) median 74679.17 tps ( 62.1 allocs/op, 13.3 tasks/op, 41958 insns/op, 0 errors) median 75380.95 tps ( 62.1 allocs/op, 13.2 tasks/op, 41946 insns/op, 0 errors) scylla_with_hash_map-read-tablets-128.txt median 75459.99 tps ( 62.1 allocs/op, 13.3 tasks/op, 42055 insns/op, 0 errors) median 74280.11 tps ( 62.1 allocs/op, 13.3 tasks/op, 42085 insns/op, 0 errors) median 74502.61 tps ( 62.1 allocs/op, 13.3 tasks/op, 42063 insns/op, 0 errors) median 74692.27 tps ( 62.1 allocs/op, 13.3 tasks/op, 41994 insns/op, 0 errors) median 75402.64 tps ( 62.1 allocs/op, 13.3 tasks/op, 42015 insns/op, 0 errors) -- writes scylla_with_chunked_vector-write-no-tablets.txt median 68635.17 tps ( 58.4 allocs/op, 13.3 tasks/op, 52709 insns/op, 0 errors) median 68716.36 tps ( 58.4 allocs/op, 13.3 tasks/op, 52691 insns/op, 0 errors) median 68512.76 tps ( 58.4 allocs/op, 13.3 tasks/op, 52721 insns/op, 0 errors) median 68606.14 tps ( 58.4 allocs/op, 13.3 tasks/op, 52696 insns/op, 0 errors) median 68619.25 tps ( 58.4 allocs/op, 13.3 tasks/op, 52697 insns/op, 0 errors) scylla_with_hash_map-write-no-tablets.txt median 67678.10 tps ( 58.4 allocs/op, 13.3 tasks/op, 52723 insns/op, 0 errors) median 67966.06 tps ( 58.4 allocs/op, 13.3 tasks/op, 52736 insns/op, 0 errors) median 67881.47 tps ( 58.4 allocs/op, 13.3 tasks/op, 52743 insns/op, 0 errors) median 67856.81 tps ( 58.4 allocs/op, 13.3 tasks/op, 52730 insns/op, 0 errors) median 67812.58 tps ( 58.4 allocs/op, 13.3 tasks/op, 52740 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-1.txt median 67741.83 tps ( 58.4 allocs/op, 13.3 tasks/op, 53425 insns/op, 0 errors) median 68014.20 tps ( 58.4 allocs/op, 13.3 tasks/op, 53455 insns/op, 0 errors) median 68228.48 tps ( 58.4 allocs/op, 13.3 tasks/op, 53447 insns/op, 0 errors) median 67950.96 tps ( 58.4 allocs/op, 13.3 tasks/op, 53443 insns/op, 0 errors) median 67832.69 tps ( 58.4 allocs/op, 13.3 tasks/op, 53462 insns/op, 0 errors) scylla_with_hash_map-write-tablets-1.txt median 66873.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53548 insns/op, 0 errors) median 67568.23 tps ( 58.4 allocs/op, 13.3 tasks/op, 53547 insns/op, 0 errors) median 67653.70 tps ( 58.4 allocs/op, 13.3 tasks/op, 53525 insns/op, 0 errors) median 67389.21 tps ( 58.4 allocs/op, 13.3 tasks/op, 53536 insns/op, 0 errors) median 67437.91 tps ( 58.4 allocs/op, 13.3 tasks/op, 53537 insns/op, 0 errors) scylla_with_chunked_vector-write-tablets-128.txt median 67115.41 tps ( 58.3 allocs/op, 13.3 tasks/op, 53341 insns/op, 0 errors) median 66836.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53342 insns/op, 0 errors) median 67214.07 tps ( 58.3 allocs/op, 13.3 tasks/op, 53303 insns/op, 0 errors) median 67198.25 tps ( 58.3 allocs/op, 13.3 tasks/op, 53347 insns/op, 0 errors) median 67368.78 tps ( 58.3 allocs/op, 13.3 tasks/op, 53374 insns/op, 0 errors) scylla_with_hash_map-write-tablets-128.txt median 66273.50 tps ( 58.3 allocs/op, 13.3 tasks/op, 53400 insns/op, 0 errors) median 66564.89 tps ( 58.3 allocs/op, 13.3 tasks/op, 53432 insns/op, 0 errors) median 66568.52 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) median 66368.00 tps ( 58.3 allocs/op, 13.3 tasks/op, 53441 insns/op, 0 errors) median 66293.55 tps ( 58.3 allocs/op, 13.3 tasks/op, 53408 insns/op, 0 errors) Fixes #18010. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18093	2024-04-04 16:25:48 +03:00
Raphael S. Carvalho	29f9f7594f	replica: Kill table::storage_group_id_for_token() storage_group_id_for_token() was only needed from within tablet_storage_group_manager, so we can kill table::storage_group_id_for_token(). Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18134	2024-04-02 09:23:23 +03:00
Avi Kivity	c5f01349b1	Merge 'Add specialized tablet_sstable_set' from Benny Halevy Make a specialized sstable_set for tablets via tablet_storage_group_manager::make_sstable_set. This sstable set takes a snapshot of the storage_groups (compound) sstable_sets and maps the selected tokens directly into the tablet compound_sstable_set. This sstable_set provides much more efficient access to the table's sstable sets as it takes advantage of the disjointness of sstable sets between tablets/storage_groups, and making it is cheaper that rebuilding a complete partitioned_sstable_set from all sstables in the table. Fixes #16876 Cassandra-stress setup: ``` $ sudo cpupower frequency-set -g userspace $ build/release/scylla (developer-mode options) --smp=16 --memory=8G --experimental-features=consistent-topology-changes --experimental-features=tablets cqlsh> CREATE KEYSPACE keyspace1 WITH replication={'class':'NetworkTopologyStrategy', 'replication_factor':1} AND tablets={'initial':2048}; $ ./tools/java/tools/bin/cassandra-stress write no-warmup n=10000000 -pop 'seq=1...10000000' -rate threads=128 $ scylla-api-client system drop_sstable_caches POST $ ./tools/java/tools/bin/cassandra-stress read no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128 $ scylla-api-client system drop_sstable_caches POST $ ./tools/java/tools/bin/cassandra-stress mixed no-warmup duration=60s -pop 'dist=uniform(1..10000000)' -rate threads=128 ``` Baseline (`0a7854ea4d`) vs. fix (`0c2c00f01b`) Throughput (op/s): workload \| baseline \| fix ---------\|----------\|---------- write \| 76,806 \| 100,787 read \| 34,330 \| 106,099 mixed \| 32,195 \| 79,246 Closes scylladb/scylladb#17149 * github.com:scylladb/scylladb: table: tablet_storage_group_manager: make tablet_sstable_set storage_group_manager: add make_sstable_set tablet_storage_group_manager: handle_tablet_split_completion: pre-calc new_tablet_count table: tablet_storage_group_manager: storage_group_of: do not validate in release build mode table: move compaction_group_list and storage_group_vector to storage_group_manager compaction_group::table_state: get_group_id: become self-sufficient compaction_group, table: make_compound_sstable_set: declare as const tablet_storage_group_manager: precalculate my_host_id and _tablet_map table: coroutinize update_effective_replication_map	2024-03-06 23:59:39 +02:00
Benny Halevy	0c2c00f01b	table: tablet_storage_group_manager: make tablet_sstable_set Make a specialized sstable_set for tablets via tablet_storage_group_manager::make_sstable_set. This sstable set takes a snapshot of the storage_groups (compound) sstable_sets and maps the selected tokens directly into the tablet compound_sstable_set. Refs #16876 Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:35:36 +02:00
Benny Halevy	7f203f0551	table: move compaction_group_list and storage_group_vector to storage_group_manager So the storage_group_manager can be used later by table_sstable_set. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:35:33 +02:00
Botond Dénes	41424231f1	Merge 'compaction: reshape sstables within compaction groups' from Lakshmi Narayanan Sreethar For tables using tablet based replication strategies, the sstables should be reshaped only within the compaction groups they belong to. The shard_reshaping_compaction_task_impl now groups the sstables based on their compaction groups before reshaping them. Fixes https://github.com/scylladb/scylladb/issues/16966 Closes scylladb/scylladb#17395 * github.com:scylladb/scylladb: test/topology_custom: add testcase to verify reshape with tablets test/pylib/rest_client: add get_sstable_info, enable/disable_autocompaction replica/distributed_loader: enable reshape for sstables compaction: reshape sstables within compaction groups replica/table : add method to get compaction group id for an sstable compaction: reshape: update total reshaped size only on success compaction: simplify exception handling in shard_reshaping_compaction_task_impl::run	2024-03-06 10:33:56 +02:00
Benny Halevy	bfe13daed4	compaction_group, table: make_compound_sstable_set: declare as const It does not modify the compaction_group/table respectively. This is required by the next patch. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:15:34 +02:00
Benny Halevy	f2ff701489	table: coroutinize update_effective_replication_map It's better to wait on deregistering the old main compaction_groups:s in handle_tablet_split_completion rather than leaving work in the background. Especially since their respective storage_groups are being destroyed by handle_tablet_split_completion. handle_tablet_split_completion keeps a continuation chain for all non-ready compaction_group stop fibers. and returns it so that update_effective_replication_map can await it, leaving no cleanup work in the background. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-03-06 10:15:34 +02:00
Botond Dénes	8213e66815	replica/database: use include page-size in max-result-size This patch changes get_unlimited_query_max_result_size(): * Also set the page-size field, not just the soft/hard limits * Renames it to get_query_max_result_size() * Update callers, specifically storage_proxy::get_max_result_size(), which now has a much simpler common return path and has to drop the page size on one rare return path. This is a purely mechanical change, no behaviour is changed.	2024-02-27 02:27:55 -05:00
Lakshmi Narayanan Sreethar	c76871aa65	replica/table : add method to get compaction group id for an sstable Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-02-23 01:07:54 +05:30
Avi Kivity	eedb997568	Merge 'compaction: upgrade: handle keyspaces that use tablets' from Lakshmi Narayanan Sreethar Tables in keyspaces governed by replication strategy that uses tablets, have separate effective_replication_maps. Update the upgrade compaction task to handle this when getting owned key ranges for a keyspace. Fixes #16848 Closes scylladb/scylladb#17335 * github.com:scylladb/scylladb: compaction: upgrade: handle keyspaces that use tablets replica/database: add an optional variant to get_keyspace_local_ranges	2024-02-15 21:31:54 +02:00
Lakshmi Narayanan Sreethar	8925a2c3cb	replica/database: add an optional variant to get_keyspace_local_ranges Add a new method database::maybe_get_keyspace_local_ranges that optionally returns the owned ranges for the given keyspace if it has a effective_replication_map for the entire keyspace. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-02-15 17:44:47 +05:30
Botond Dénes	7f17d3bb0e	replica/database: keyspace: add uses_tablets() Mirroring table::uses_tablets(), provides a convenient and -- more importabtly -- easily discoverable way to determine whether the keyspace uses tablets or not. This information is of course already available via the abstract replication strategy, but as seen in a few examples, this is not easily discoverable and sometimes people resorted to enumerating the keyspace's tables to be able to invoke table::uses_tablets().	2024-02-15 01:51:26 -05:00

1 2 3 4 5 ...

386 Commits