scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-22 17:40:34 +00:00

Author	SHA1	Message	Date
Lakshmi Narayanan Sreethar	e2142974f8	replica/database: pass abort_source to database constructor This is in preparation for the following patch that adds abort_source variable to the sstables_manager. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-07-16 20:36:06 +05:30
Botond Dénes	53a6ec05ed	Merge 'replica: remove rwlock for protecting iteration over storage group map' from Raphael "Raph" Carvalho rwlock was added to protect iterations against concurrent updates to the map. the updates can happen when allocating a new tablet replica or removing an old one (tablet cleanup). the rwlock is very problematic because it can result in topology changes blocked, as updating token metadata takes the exclusive lock, which is serialized with table wide ops like split / major / explicit flush (and those can take a long time). to get rid of the lock, we can copy the storage group map and guard individual groups with a gate (not a problem since map is expected to have a maximum of ~100 elements). so cleanup can close that gate (carefully closed after stopping individual groups such that migrations aren't blocked by long-running ops like major), and ongoing iterations (e.g. triggered by nodetool flush) can skip a group that was closed, as such a group is being migrated out. Fixes #18821. ``` WRITE ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets --write - BEFORE 65559.52 tps ( 59.6 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52841 insns/op, 30946 cycles/op, 0 errors) 67408.05 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53018 insns/op, 30874 cycles/op, 0 errors) 67714.72 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53026 insns/op, 30881 cycles/op, 0 errors) 67825.57 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53015 insns/op, 30821 cycles/op, 0 errors) 67810.74 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53009 insns/op, 30828 cycles/op, 0 errors) throughput: mean=67263.72 standard-deviation=967.40 median=67714.72 median-absolute-deviation=547.02 maximum=67825.57 minimum=65559.52 instructions_per_op: mean=52981.61 standard-deviation=79.09 median=53014.96 median-absolute-deviation=36.54 maximum=53025.79 minimum=52840.56 cpu_cycles_per_op: mean=30869.90 standard-deviation=50.23 median=30874.06 median-absolute-deviation=42.11 maximum=30945.94 minimum=30820.89 - AFTER 65448.76 tps ( 59.5 allocs/op, 16.4 logallocs/op, 14.3 tasks/op, 52788 insns/op, 31013 cycles/op, 0 errors) 67290.83 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30950 cycles/op, 0 errors) 67646.81 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53025 insns/op, 30909 cycles/op, 0 errors) 67565.90 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 53058 insns/op, 30951 cycles/op, 0 errors) 67537.32 tps ( 59.3 allocs/op, 16.0 logallocs/op, 14.3 tasks/op, 52983 insns/op, 30963 cycles/op, 0 errors) throughput: mean=67097.93 standard-deviation=931.44 median=67537.32 median-absolute-deviation=467.97 maximum=67646.81 minimum=65448.76 instructions_per_op: mean=52975.85 standard-deviation=108.07 median=53024.55 median-absolute-deviation=49.45 maximum=53057.99 minimum=52788.49 cpu_cycles_per_op: mean=30957.17 standard-deviation=37.43 median=30951.31 median-absolute-deviation=7.51 maximum=31013.01 minimum=30908.62 READ ===== ./build/release/scylla perf-simple-query --smp 1 --memory 2G --initial-tablets 10 --tablets - BEFORE 79423.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41840 insns/op, 26820 cycles/op, 0 errors) 81076.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41837 insns/op, 26583 cycles/op, 0 errors) 80927.36 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41829 insns/op, 26629 cycles/op, 0 errors) 80539.44 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41841 insns/op, 26735 cycles/op, 0 errors) 80793.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41864 insns/op, 26662 cycles/op, 0 errors) throughput: mean=80551.99 standard-deviation=661.12 median=80793.10 median-absolute-deviation=375.37 maximum=81076.70 minimum=79423.36 instructions_per_op: mean=41842.20 standard-deviation=13.26 median=41840.14 median-absolute-deviation=5.68 maximum=41864.50 minimum=41829.29 cpu_cycles_per_op: mean=26685.88 standard-deviation=93.31 median=26662.18 median-absolute-deviation=56.47 maximum=26820.08 minimum=26582.68 - AFTER 79464.70 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41799 insns/op, 26761 cycles/op, 0 errors) 80954.58 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41803 insns/op, 26605 cycles/op, 0 errors) 81160.90 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41811 insns/op, 26555 cycles/op, 0 errors) 81263.10 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41814 insns/op, 26527 cycles/op, 0 errors) 81162.97 tps ( 63.1 allocs/op, 0.0 logallocs/op, 14.2 tasks/op, 41806 insns/op, 26549 cycles/op, 0 errors) throughput: mean=80801.25 standard-deviation=755.54 median=81160.90 median-absolute-deviation=361.72 maximum=81263.10 minimum=79464.70 instructions_per_op: mean=41806.47 standard-deviation=5.85 median=41806.05 median-absolute-deviation=4.05 maximum=41813.86 minimum=41799.36 cpu_cycles_per_op: mean=26599.22 standard-deviation=94.84 median=26554.54 median-absolute-deviation=50.51 maximum=26761.06 minimum=26527.05 ``` Closes scylladb/scylladb#19469 * github.com:scylladb/scylladb: replica: remove rwlock for protecting iteration over storage group map replica: get rid of fragile compaction group intrusive list	2024-07-12 15:45:36 +03:00
Raphael S. Carvalho	ad5c5bca5f	replica: get rid of fragile compaction group intrusive list It was added to make integration of storage groups easier, but it's complicated since it's another source of truth and we could have problems if it becomes inconsistent with the group map. Fixes #18506. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-07-09 16:53:35 -03:00
Avi Kivity	f31d5e3204	Merge 'repair/streaming: enable toggling tombstone gc with a config item' from Botond Dénes We currently disable tombstone GC for compaction done on the read path of streaming and repair, because those expired tombstones can still prevent data resurrection. With time-based tombstone GC, missing a repair for long enough can cause data resurrection because a tombstone is potentially GC'd before it could be spread to every node by repair. So repair disseminating these expired tombstones helps clusters which missed repair for long enough. It is not a guarantee because compaction could have done the GC itself, but it is better than nothing. This last resort is getting less important with repair-based tombstone GC. Furthermore, we have seen this cause huge repair amplification in a cluster, where expired tombstones triggered repair replicating otherwise identical rows. This series makes tombstone GC on the streaming/repair compaction path configurable with a config item. This new config item defaults to `false` (current behaviour), setting it to `true`, will enable tombstone GC. Fixes: https://github.com/scylladb/scylladb/issues/19015 Not a regression, no backport needed Closes scylladb/scylladb#19016 * github.com:scylladb/scylladb: test/topology_custom/test_repair: add test for enable_tombstone_gc_for_streaming_and_repair replica/table: maybe_compact_for_streaming(): toggle tombstone GC based on the control flag replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming() db/config: introduce enable_tombstone_gc_for_streaming_and_repair	2024-07-09 19:04:11 +03:00
Michael Litvak	08b29460fc	mv: skip building view updates on a pending replica Currently, a pending replica that applies a write on a table that has materialized views, will build all the view updates as a normal replica, only to realize at a late point, in db::view::get_view_natural_endpoint(), that it doesn't have a paired view replica to send the updates to. It will then either drop the view updates, or send them to a pending view replica, if such exists. This work is unnecessary since it may be dropped, and even if there is a pending view replica to send the updates to, the updates that are built by the pending replica may be wrong since it may have incomplete information. This commit fixes the inefficiency by skipping the view update building step when applying an update on a pending replica. The metric total_view_updates_on_wrong_node is added to count the cases that a view update is determined to be unnecessary. The test reproduces the scenario of writing to a table and applying the update on a pending replica, and verifies that the pending replica doesn't try to build view updates. Fixes scylladb/scylladb#19152 Closes scylladb/scylladb#19488	2024-07-02 13:10:18 +02:00
Botond Dénes	415457be2b	replica: propagate enable_tombstone_gc_for_streaming_and_repair to maybe_compact_for_streaming() Just wiring, the new flag will be used in the next patch.	2024-06-26 04:05:17 -04:00
Avi Kivity	fdc1449392	treewide: rename flat_mutation_reader_v2 to mutation_reader flat_mutation_reader_v2 was introduced in a pair of commits in 2021: `e3309322c3` "Clone flat_mutation_reader related classes into v2 variants" `08b5773c12` "Adapt flat_mutation_reader_v2 to the new version of the API" as a replacement for flat_mutation_reader, using range_tombstone_change instead of range_tombstone to represent represent range tombstones. See those commits for more information. The transition was incremental; the last use of the original flat_mutation_reader was removed in 2022 in commit `026f8cc1e7` "db: Use mutation_partition_v2 in mvcc" In turn, flat_mutation_reader was introduced in 2017 in commit `748205ca75` "Introduce flat_mutation_reader" To transition from a mutation_reader that nested rows within a partition in a separate stream, to a flat reader that streamed partitions and rows in the same stream. Here, we reclaim the original name and rename the awkward flat_mutation_reader_v2 to mutation_reader. Note that mutation_fragment_v2 remains since we still use the original for compatibilty, sometimes. Some notes about the transition: - files were also renamed. In one case (flat_mutation_reader_test.cc), the rename target already existed, so we rename to mutation_reader_another_test.cc. - a namespace 'mutation_reader' with two definitions existed (in mutation_reader_fwd.hh). Its contents was folded into the mutation_reader class. As a result, a few #includes had to be adjusted. Closes scylladb/scylladb#19356	2024-06-21 07:12:06 +03:00
Avi Kivity	185338c8cf	Merge 'Reduce TWCS off-strategy space overhead' from Raphael "Raph" Carvalho Normally, the space overhead for TWCS is 1/N, where is number of windows. But during off-strategy, the overhead is 100% because input sstables cannot be released earlier. Reshaping a TWCS table that takes ~50% of available space can result in system running out of space. That's fixed by restricting every TWCS off-strategy job to 10% of free space in disk. Tables that aren't big will not be penalized with increased write amplification, as all input (disjoint) sstables can still be compacted in a single round. Fixes #16514. Closes scylladb/scylladb#18137 * github.com:scylladb/scylladb: compaction: Reduce twcs off-strategy space overhead to 10% of free space compaction: wire storage free space into reshape procedure sstables: Allow to get free space from underlying storage replica: don't expose compaction_group to reshape task	2024-06-20 18:51:25 +03:00
Raphael S. Carvalho	b8bd4c51c2	replica: don't expose compaction_group to reshape task compaction_group sits in replica layer and compaction layer is supposed to talk to it through compaction::table_state only. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-13 12:43:14 -03:00
Raphael S. Carvalho	f3a1f5df83	replica: simplify perform_cleanup_compaction() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 12:44:21 -03:00
Raphael S. Carvalho	6214dda506	replica: return storage_group by reference on storage_group_for*() those functions cannot return nullptr, will throw when group is not found, so better return ref instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:53:06 -03:00
Raphael S. Carvalho	9c1d3bcc02	replica: devirtualize storage_group_of() can be made private to tablet_storage_group_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:29:49 -03:00
Pavel Emelyanov	882b2f4e9f	cql3, schema_tables: Generalize function creation When a function is created with the CREATE FUNCTION statement, the statement handler does all the necessary preparations on its own. The very same code exists in schema_tables, when the function is loaded on boot. This patch generalizes both and keeps function language-specific context creation inside lang/ code. The creation function returns context via argument reference. It would have been nicer if it was returned via future<>, but it's not suitable for future<T> type :( Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-07 13:07:05 +03:00
Pavel Emelyanov	f950469af5	lang: Move manager to lang namespace And, while at it, rename local variable to refer to it to as "manager" not "wasm". Query processor and database also have getters named "wasm()", these are not renamed yet to keep patch smaller (and those getters are going to be reworked further anyway). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-07 12:35:57 +03:00
Pavel Emelyanov	ad0e6b79fc	replica: Remove all_datadir from keyspace config This vector of paths is only used to generate the same vector of paths for table config, but the latter already has all the needed info. It's the part of the plan to stop using paths/directories in keyspaces and tables, because with storage-options tables no longer keep their data in "files on disk", so this information goes to sstables storage manager (refs #12707) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#19119	2024-06-06 08:30:34 +03:00
Raphael S. Carvalho	b396b05e20	replica: Fix race of tablet snapshot with compaction tablet snapshot, used by migration, can race with compaction and can find files deleted. That won't cause data loss because the error is propagated back into the coordinator that decides to retry streaming stage. So the consequence is delayed migration, which might in turn reduce node operation throughput (e.g. when decommissioning a node). It should be rare though, so shouldn't have drastic consequences. Fixes #18977. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18979	2024-05-31 09:58:49 +03:00
Botond Dénes	2d79b0106c	Merge 'storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes #18085. Closes scylladb/scylladb#18287 * github.com:scylladb/scylladb: test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-05-27 16:32:54 +03:00
Pavel Emelyanov	31edab277a	database: Don't export statement scheduling group Now all the code gets this group from elsewhere and the method can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-24 18:00:01 +03:00
Raphael S. Carvalho	abcc68dbe7	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-22 09:25:29 -03:00
Piotr Dulikowski	9820472277	main: introduce schema commitlog scheduling group Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However: - It is important and significant enough that it should run in a scheduling group that is separate from the main one, - It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues. Therefore, introduce a new scheduling group dedicated to the schema commitlog. Fixes: scylladb/scylladb#15566 Closes scylladb/scylladb#18715	2024-05-21 11:29:57 +02:00
Botond Dénes	f239339a29	Merge 'Improve modularity of some per-table API endpoints' from Pavel Emelyanov There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header. Closes scylladb/scylladb#18703 * github.com:scylladb/scylladb: api,database: Move auto-compaction toggle guard api: Move some table manipulation helpers from storage_service api: Move table-related calls from storage_service domain api: Reimplement some endpoints using existing helpers api: Lost unset of tombstone-gc endpoints	2024-05-20 18:01:54 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Pavel Emelyanov	31d05925cc	api,database: Move auto-compaction toggle guard Toggling per-table auto-compaction enabling bit is guarded with on-database boolean and raii guard. It's only used by a single api/column_family.cc file, so it can live there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-16 14:42:51 +03:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	679baff25a	dht, replica: Remove deprecated sharder APIs	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dbca598e99	replica: Deprecate table::shard_of()	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Pavel Emelyanov	59aec1f300	database: Don't break namespace withexternal alias The namespace replica is broken in the middle with sstable_list alias, while the latter can be declared earlier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18664	2024-05-14 16:45:20 +03:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Botond Dénes	afa870a387	Merge 'Some sstable set related improvements' from Raphael "Raph" Carvalho Closes scylladb/scylladb#18616 * github.com:scylladb/scylladb: replica: Make it explicit table's sstable set is immutable replica: avoid reallocations in tablet_sstable_set replica: Avoid compound set if only one sstable set is filled	2024-05-13 14:17:24 +03:00
Raphael S. Carvalho	7faba69f28	replica: Make it explicit table's sstable set is immutable Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-10 11:58:08 -03:00
Aleksandra Martyniuk	b4371a0ea0	replica: allocate storage groups dynamically Currently empty storage_groups are allocated for tablets that are not on this shard. Allocate storage groups dynamically, i.e.: - on table creation allocate only storage groups that are on this shard; - allocate a storage group for tablet that is moved to this shard; - deallocate storage group for tablet that is cleaned up. Stop compaction group before it's deallocated. Add a flag to table::cleanup_tablet deciding whether to deallocate sgs and use it in commitlog tests.	2024-05-10 15:08:21 +02:00
Aleksandra Martyniuk	c283746b32	replica: add rwlock to storage_group_manager Add rwlock which prevents storage groups from being added/deleted while some other layers itereates over them (or their compaction groups). Add methods to iterate over storage groups with the lock held.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	532653f118	replica: replace table::as_table_state Replace table::as_table_state with table::try_get_table_state_with_static_sharding which throws if a table does not use static sharding.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	cf9913b0b7	compaction: pass compaction group id to reshape_compaction_group Pass compaction group id to shard_reshaping_compaction_task_impl::reshape_compaction_group. Modify table::as_table_state to return table_state of the given compaction group.	2024-05-10 14:56:38 +02:00
Aleksandra Martyniuk	8505389963	replica: drop single_compaction_group_if_available Drop single_compaction_group_if_available as it's unused.	2024-05-10 14:56:38 +02:00
Botond Dénes	a062e3f650	replica/database: introduce clear_inactive_reads_for_tablet() To be used on the tablet cleanup path, to clear any inactive read which might be related to the cleaned-up tablet.	2024-04-30 01:44:03 -04:00
Botond Dénes	338af5055c	replica/database: introduce foreach_reader_concurrency_semaphore Currently we have a single method -- detach_column_family() -- which does something with each semaphore. Soon there will be another one. Introduce a method to do something with all semaphores, to make this smoother. Enterprise has a different set of semaphores, and this will reduce friction.	2024-04-30 01:43:56 -04:00
Botond Dénes	044fd7a3ec	Merge 'Move some view updating methods from table to view_update_generator' from Pavel Emelyanov The populate_views() and generate_and_propagate_view_updates() both naturally belong to view_update_generator -- they don't need anything special from table itself, but rather depend on some internals of the v.u.generator itself. Moving them there lets removing the view concurrency semaphore from keyspace and table, thus reducing the cross-components dependencies. Closes scylladb/scylladb#18421 * github.com:scylladb/scylladb: replica: Do not carry view concurrency semaphore pointer around view: Get concurrency semaphore via database, not table view_update_generator: Mark mutate_MV() private view: Move view_update_generator methods' code view: Move table::generate_and_propagate_view_updates into view code view: Move table::populate_views() into view_update_generator class	2024-04-26 10:55:38 +03:00
Pavel Emelyanov	18cc2cfa31	replica: Generalize snapshot details for single table/snapshot dir There are two places that get total:live stats for a table snapshot -- database::get_snapshot_details() and table::get_snapshot_details(). Both do pretty similar thing -- walk the table/snapshots/ directory, then each of the found sub-directory and accumulate the found files' sizes into snapshot details structure. Both try to tell total from live sizes by checking whether an sstable component found in snapshots is present in the table datadir. The database code does it in a more correct way -- not just checks the file presense, but also compares if it's a hardlink on the snapshot file, while the table code just checks if the file of the same name exists. This patch does both -- makes both database and table call the same helper method for a single snapshot details, and makes the generalized version use more elaborated collision check, thus fixing the per-table details getting behavior. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18347	2024-04-25 17:12:42 +03:00
Pavel Emelyanov	8aaa09ee97	replica: Do not carry view concurrency semaphore pointer around Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:27:43 +03:00
Pavel Emelyanov	2ee7c41139	view: Get concurrency semaphore via database, not table The _view_update_concurrency_sem field on database propagates itself via keyspace config down to table config and view_update_generator then grabs one via table:: helper. That's an overkil, view_update_generator has direct reference on the database and can get this semaphore from there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:25:57 +03:00
Pavel Emelyanov	c2bf6b43b2	view: Move table::generate_and_propagate_view_updates into view code Similarly to populate_views() method, this one also naturally belongs to view_update_generator class. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:20:06 +03:00
Pavel Emelyanov	670c7c925c	view: Move table::populate_views() into view_update_generator class The method in question has little to do with table, effectively it only needs stats and consurrency semaphore. And the semaphore in question is obtained from table indirectly, it really resides on database. On the other hand, the method carries lots of bits from db::view, e.g. the view_update_builder class, memory_usage_of() helper and a bit more. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-25 14:17:20 +03:00
Botond Dénes	572003c469	Merge 'Cleanup the way snapshot details are propagated via API' from Pavel Emelyanov There's a database::get_snapshot_details() method that returns collection of all snapshots for all ks.cf out there and there are several snapshot_details aux structures around it. This PR keeps only one "details" and cleans up the way it propagates from database up to the respective API calls. Closes scylladb/scylladb#18317 * github.com:scylladb/scylladb: snapshot_ctl: Brush up true_snapshots_size() internals snapshot_ctl: Remove unused details struct snapshot_ctl: No double recoding of details database,snapshots: Move database::snapshot_details into snapshot_ctl database,snapshots: Make database::get_snapshot_details() return map, not vector table,snapshots: Move table::snapshot_details into snapshot_ctl	2024-04-23 16:28:25 +03:00
Pavel Emelyanov	8ec3f057a8	database,snapshots: Move database::snapshot_details into snapshot_ctl Similarly to how it looks like for table::snapshot_details Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 20:04:29 +03:00
Pavel Emelyanov	f6bc283bbb	database,snapshots: Make database::get_snapshot_details() return map, not vector So that it's in-sync with table::get_snapshot_details(). Next patches will improve this place even further. Also, there can be many snapshots and vector can grow large, but that's less of an issue here. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 20:04:25 +03:00
Pavel Emelyanov	a36c13beb3	table,snapshots: Move table::snapshot_details into snapshot_ctl Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 19:59:34 +03:00
Pavel Emelyanov	ba58b71eea	database: Keep local directory_semaphore to initialize sstables managers Now database is constructed with sharded<directory_semaphore>, but it no longer needs sharded, local is enough. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-04-19 13:53:57 +03:00

1 2 3 4 5 ...

400 Commits