scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-21 00:50:35 +00:00

Author	SHA1	Message	Date
Botond Dénes	ea40567bbc	Merge 'Some cleanups for replica table' from Raphael "Raph" Carvalho backport not needed, these are just cleanups. Closes scylladb/scylladb#19260 * github.com:scylladb/scylladb: replica: simplify perform_cleanup_compaction() replica: return storage_group by reference on storage_group_for*() replica: devirtualize storage_group_of()	2024-06-14 08:14:58 +03:00
Raphael S. Carvalho	f143f5b90d	replica: remove linear search when picking memtable_list for range scan with tablets with tablets, we're expected to have a worst of ~100 tablets in a given table and shard, so let's avoid linear search when looking for the memtable_list in a range scan. we're bounded by ~100 elements, so shouldn't be a big problem, but it's an inefficiency we can easily get rid of. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#19286	2024-06-14 08:00:17 +03:00
Raphael S. Carvalho	f3a1f5df83	replica: simplify perform_cleanup_compaction() Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 12:44:21 -03:00
Raphael S. Carvalho	6214dda506	replica: return storage_group by reference on storage_group_for*() those functions cannot return nullptr, will throw when group is not found, so better return ref instead. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:53:06 -03:00
Raphael S. Carvalho	9c1d3bcc02	replica: devirtualize storage_group_of() can be made private to tablet_storage_group_manager. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-06-12 11:29:49 -03:00
Avi Kivity	6608f49718	Merge 'make enable_compacting_data_for_streaming_and_repair truly live-update' from Botond Dénes This config item is propagated to the table object via table::config. Although the field in `table::config`, used to propagate the value, was `utils::updateable_value<T>`, it was assigned a constant and so the live-update chain was broken. This series fixes this and adds a test which fails before the patch and passes after. The test needed new test infrastructure, around the failure injection api, namely the ability to exfiltrate the value of internal variable. This infrastructure is also added in this series. Fixes: https://github.com/scylladb/scylladb/issues/18674 - [x] This patch has to be backported because it fixes broken functionality Closes scylladb/scylladb#18705 * github.com:scylladb/scylladb: test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update test/pylib: rest_client: add get_injection() api/error_injection: add getter for error_injection utils/error_injection: add set_parameter() replica/database: fix live-update enable_compacting_data_for_streaming_and_repair	2024-06-11 15:53:19 +03:00
Botond Dénes	8ef4fbdb87	test/topology_custom: add test for enable_compacting_data_for_streaming_and_repair live-update Avoid this the live-update feature of this config item breaking silently.	2024-06-11 04:17:48 -04:00
Botond Dénes	dbccb61636	replica/database: fix live-update enable_compacting_data_for_streaming_and_repair This config item is propagated to the table object via table::config. Although the field in table::config, used to propagate the value, was utils::updateable_value<T>, it was assigned a constant and so the live-update chain was broken. This patch fixes this.	2024-06-11 01:15:20 -04:00
Raphael S. Carvalho	7b41630299	replica: Refresh mutation source when allocating tablet replicas Consider the following: 1) table A has N tablets and views 2) migration starts for a tablet of A from node 1 to 2. 3) migration is at write_both_read_old stage 4) coordinator will push writes to both nodes (pending and leaving) 5) A has view, so writes to it will also result in reads (table::push_view_replica_updates()) 6) tablet's update_effective_replication_map() is not refreshing tablet sstable set (for new tablet migrating in) 7) so read on step 5 is not being able to find sstable set for tablet migrating in Causes the following error: "tablets - SSTable set wasn't found for tablet 21 of table mview.users" which means loss of write on pending replica. The fix will refresh the table's sstable set (tablet_sstable_set) and cache's snapshot. It's not a problem to refresh the cache snapshot as long as the logical state of the data hasn't changed, which is true when allocating new tablet replicas. That's also done in the context of compactions for example. Fixes #19052. Fixes #19033. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#19099	2024-06-11 06:59:04 +03:00
Pavel Emelyanov	882b2f4e9f	cql3, schema_tables: Generalize function creation When a function is created with the CREATE FUNCTION statement, the statement handler does all the necessary preparations on its own. The very same code exists in schema_tables, when the function is loaded on boot. This patch generalizes both and keeps function language-specific context creation inside lang/ code. The creation function returns context via argument reference. It would have been nicer if it was returned via future<>, but it's not suitable for future<T> type :( Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-07 13:07:05 +03:00
Pavel Emelyanov	f950469af5	lang: Move manager to lang namespace And, while at it, rename local variable to refer to it to as "manager" not "wasm". Query processor and database also have getters named "wasm()", these are not renamed yet to keep patch smaller (and those getters are going to be reworked further anyway). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-06-07 12:35:57 +03:00
Pavel Emelyanov	4606302ead	distributed_loader: Remove base_path from populator It's unused, populator uses it to print debugging messages, but it can as well use table->dir() for it, just as sstable_directory does. One message looks useless and is removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#19113	2024-06-06 08:49:41 +03:00
Pavel Emelyanov	ad0e6b79fc	replica: Remove all_datadir from keyspace config This vector of paths is only used to generate the same vector of paths for table config, but the latter already has all the needed info. It's the part of the plan to stop using paths/directories in keyspaces and tables, because with storage-options tables no longer keep their data in "files on disk", so this information goes to sstables storage manager (refs #12707) Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#19119	2024-06-06 08:30:34 +03:00
Raphael S. Carvalho	b396b05e20	replica: Fix race of tablet snapshot with compaction tablet snapshot, used by migration, can race with compaction and can find files deleted. That won't cause data loss because the error is propagated back into the coordinator that decides to retry streaming stage. So the consequence is delayed migration, which might in turn reduce node operation throughput (e.g. when decommissioning a node). It should be rare though, so shouldn't have drastic consequences. Fixes #18977. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18979	2024-05-31 09:58:49 +03:00
Botond Dénes	0ead3570b4	Merge 'Run sstables loader in scheduling group' from Pavel Emelyanov Currently the loader is called via API, which inherits the maintenance scheduling group from API http server. The loader then can either do load_and_stream() or call (legacy) distributed_loader::upload_new_sstables(). The latter first switches into streaming scheduling group, but the former doesn't and continues running in the maintenance one. All this is not really a problem, because streaming sched group and maintenance sched group is one group under two different variable names. However, it's messy and worth delegating the sched group switch (even if it's a no-op) to the sstables-loader. As a nice side effect, this patch removes one place that uses database as proxy object to get configuration parameters. Closes scylladb/scylladb#18928 * github.com:scylladb/scylladb: sstables-loader: Run loading in its scheduling group sstables-loader: Add scheduling group to constructor	2024-05-30 18:03:51 +03:00
Raphael S. Carvalho	578a6c1e07	replica: Only consume memtable of the tablet intersecting with range read storage_proxy is responsible for intersecting the range of the read with tablets, and calling replica with a single tablet range, therefore it makes sense to avoid touching memtables of tablets that don't intersect with a particular range. Note this is a performance issue, not correctness one, as memtable readers that don't intersect with current range won't produce any data, but cpu is wasted until that's realized (they're added to list of readers in mutation_reader_merger, more allocations, more data sources to peek into, etc). That's also important for streaming e.g. after decommission, that will consume one tablet at a time through a reader, so we don't want memtables of streamed tablets (that weren't cleaned up yet) to be consumed. Refs #18904. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18907	2024-05-29 15:58:33 +03:00
Pavel Emelyanov	ae622d711e	sstables-loader: Run loading in its scheduling group Now the loading code has two different paths, and only one of them switches sched group. It's cleaner and more natural to switch the sched group in the loader itself, so that all code paths run in it and don't care switching. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-28 11:07:58 +03:00
Botond Dénes	2d79b0106c	Merge 'storage_service: Fix race between tablet split and stats retrieval' from Raphael "Raph" Carvalho Retrieval of tablet stats must be serialized with mutation to token metadata, as the former requires tablet id stability. If tablet split is finalized while retrieving stats, the saved erm, used by all shards, can have a lower tablet count than the one in a particular shard, causing an abort as tablet map requires that any id feeded into it is lower than its current tablet count. Fixes #18085. Closes scylladb/scylladb#18287 * github.com:scylladb/scylladb: test: Fix flakiness in topology_experimental_raft/test_tablets service: Use tablet read selector to determine which replica to account table stats storage_service: Fix race between tablet split and stats retrieval	2024-05-27 16:32:54 +03:00
Botond Dénes	47dbf23773	Merge 'Rework view services and system-distributed-keyspace dependencies' from Pavel Emelyanov The system-distributed-keyspace and view-update-generator often go in pair, because streaming, repair and sstables-loader (via distributed-loader) need them booth to check if sstable is staging and register it if it's such. The check is performed by messing directly with system_distributed.view_build_status table, and the registration happens via view-update-generator. That's not nice, other services shouldn't know that view status is kept in system table. Also view-update-generator is a service to generae and push view updates, the fact that it keeps staging sstables list is the implementation detail. This PR replaces dependencies on the mentioned pair of services with the single dependency on view-builder (repair, sstables-loader and stream-manager are enlightened) and hides the view building-vs-staging details inside the view_builder. Along the way, some simplification of repair_writer_impl class is done. Closes scylladb/scylladb#18706 * github.com:scylladb/scylladb: stream_manager: Remove system_distributed_keyspace and view_update_generator repair: Remove system_distributed_keyspace and view_update_generator streaming: Remove system_distributed_keyspace and view_update_generator sstables_loader: Remove system_distributed_keyspace and view_update_generator distributed_loader: Remove system_distributed_keyspace and view_update_generator view: Make register_staging_sstable() a method of view_builder view: Make check_view_build_ongoing() helper a method of view_builder streaming: Proparage view_builder& down to make_streaming_consumer() repair: Keep view_builder& on repair_writer_impl distributed_loader: Propagate view_builder& via process_upload_dir() stream_manager: Add view builder dependency repair_service: Add view builder dependency sstables_loader: Add view_bulder dependency main: Start sstables loader later repair: Remove unwanted local references from repair_meta	2024-05-27 10:51:11 +03:00
Botond Dénes	e0f4d79f3b	Merge 'Do not export statement scheduling group from database' from Pavel Emelyanov Database used to be (and still is in many ways) an object used to get configuration from. Part of the configuration is the set of pre-configured scheduling groups. That's not nice, services should use each other for some real need, not as proxies to configuration. This patch patches the places that explicitly switch to statement group _not_ to use database to get the group itself. fixes: #17643 Closes scylladb/scylladb#18799 * github.com:scylladb/scylladb: database: Don't export statement scheduling group test: Use async attrs and cql-test-env scheduling groups test: Use get_scheduling_groups() to get scheduling groups api: Don't switch sched group to start/stop protocol servers main: Don't switch sched group to start protocol servers code: Switch to sched group in request_stop_server() code: Switch to server sched group in start() protocol_server: Keep scheduling group on board code: Add scheduling group to controllers redis: Coroutinize start() method	2024-05-27 10:48:33 +03:00
Marcin Maliszkiewicz	2ab143fb40	db: auth: move auth tables to system keyspace Separate keyspace which also behaves as system brings little benefit while creating some compatibility problems like schema digest mismatch during rollback. So we decided to move auth tables into system keyspace. Fixes https://github.com/scylladb/scylladb/issues/18098 Closes scylladb/scylladb#18769	2024-05-26 22:30:42 +03:00
Kefu Chai	4e9596a5a9	treewide: replace std::result_of_t with std::invoke_result_t in theory, std::result_of_t should have been removed in C++20. and std::invoke_result_t is available since C++17. thanks to libstdc++, the tree is compiling. but we should not rely on this. so, in this change, we replace all `std::result_of_t` with `std::invoke_result_t`. actually, clang + libstdc++ is already warning us like: ``` In file included from /home/runner/work/scylladb/scylladb/multishard_mutation_query.cc:9: In file included from /home/runner/work/scylladb/scylladb/schema/schema_registry.hh:11: In file included from /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/unordered_map:38: Warning: /usr/bin/../lib/gcc/x86_64-linux-gnu/12/../../../../include/c++/12/type_traits:2624:5: warning: 'result_of<void (noop_compacted_fragments_consumer::*(noop_compacted_fragments_consumer &))()>' is deprecated: use 'std::invoke_result' instead [-Wdeprecated-declarations] 2624 \| using result_of_t = typename result_of<_Tp>::type; \| ^ /home/runner/work/scylladb/scylladb/mutation/mutation_compactor.hh:518:43: note: in instantiation of template type alias 'result_of_t' requested here 518 \| if constexpr (std::is_same_v<std::result_of_t<decltype(&GCConsumer::consume_end_of_stream)(GCConsumer&)>, void>) { \| ``` Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18835	2024-05-26 16:45:42 +03:00
Pavel Emelyanov	31edab277a	database: Don't export statement scheduling group Now all the code gets this group from elsewhere and the method can be removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-24 18:00:01 +03:00
Pavel Emelyanov	b728857954	distributed_loader: Remove system_distributed_keyspace and view_update_generator Now all the code is happy with view_builder and can be shortened Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-23 13:41:47 +03:00
Pavel Emelyanov	66a8035b64	view: Make register_staging_sstable() a method of view_builder Callers of it had just checked if an sstable still has some views building, so the should talk to view-builder to register the sstable that's now considered to be staging. Effectively. this is to hide the view-update-generator from other services and make them communicate with the builder only. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-23 13:41:47 +03:00
Pavel Emelyanov	92ff0d3fc3	view: Make check_view_build_ongoing() helper a method of view_builder This helper checks if there's an ongoing build of a view, and it's in fact internal to view-builder, who keeps its status in one of its system tables. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-23 13:41:47 +03:00
Pavel Emelyanov	0d946a5fdf	distributed_loader: Propagate view_builder& via process_upload_dir() Preparation to next patches, they'll make use of this new argument Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-23 13:32:28 +03:00
Raphael S. Carvalho	eb8ef38543	replica: Fix tablet's compaction_groups_for_token_range() with unowned range File-based tablet streaming calls every shard to return data of every group that intersects with a given range. After dynamic group allocation, that breaks as the tablet range will only be present in a single shard, so an exception is thrown causing migration to halt during streaming phase. Ideally, only one shard is invoked, but that's out of the scope of this fix and compaction_groups_for_token_range() should return empty result if none of the local groups intersect with the range. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18798	2024-05-22 20:15:33 +03:00
Raphael S. Carvalho	abcc68dbe7	storage_service: Fix race between tablet split and stats retrieval If tablet split is finalized while retrieving stats, the saved erm, used by all shards, will be invalidated. It can either cause incorrect behavior or crash if id is not available. It's worked by feeding local tablet map into the "coordinator" collecting stats from all shards. We will also no longer have a snapshot of erm shared between shards to help intra-node migration. This is simplified by serializing token metadata changes and the retrieval of the stats (latter should complete pretty fast, so it shouldn't block the former for any significant time). Fixes #18085. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-22 09:25:29 -03:00
Piotr Dulikowski	9820472277	main: introduce schema commitlog scheduling group Currently, we do not explicitly set a scheduling group for the schema commitlog which causes it to run in the default scheduling group (called "main"). However: - It is important and significant enough that it should run in a scheduling group that is separate from the main one, - It should not run in the existing "commitlog" group as user writes may sometimes need to wait for schema commitlog writes (e.g. read barrier done to learn the schema necessary to interpret the user write) and we want to avoid priority inversion issues. Therefore, introduce a new scheduling group dedicated to the schema commitlog. Fixes: scylladb/scylladb#15566 Closes scylladb/scylladb#18715	2024-05-21 11:29:57 +02:00
Botond Dénes	f239339a29	Merge 'Improve modularity of some per-table API endpoints' from Pavel Emelyanov There's a set of API endpoints that toggle per-table auto-compaction and tombstone-gc booleans. They all live in two different .cc files under api/ directory and duplicate code of each other. This PR generalizes those handlers, places them next to each other, fixes leak on stop and, as a nice side effect, enlightens database.hh header. Closes scylladb/scylladb#18703 * github.com:scylladb/scylladb: api,database: Move auto-compaction toggle guard api: Move some table manipulation helpers from storage_service api: Move table-related calls from storage_service domain api: Reimplement some endpoints using existing helpers api: Lost unset of tombstone-gc endpoints	2024-05-20 18:01:54 +03:00
Avi Kivity	61505d057e	Merge 'Sort user-defined types in describe statements' from Michał Jadwiszczak User-defined types can depend on each other, creating directed acyclic graph. In order to support restoring schema from `DESC SCHEMA`, UDTs should be ordered topologically, not alphabetically as it was till now. This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema. Fixes #18539 Closes scylladb/scylladb#18302 * github.com:scylladb/scylladb: test/cql-pytest/test_describe: add test for UDTs ordering cql3/statements/describe_statement: UDTs topological sorting cql3/statements/describe_statement: allow to skip alphabetical sorting types: add a method to get all referenced user types db/cql_type_parser: use generic topological sorting db/cql_type_parses: futurize raw_builder::build() test/boost: add test for topological sorting utils: introduce generic topological sorting algorithm	2024-05-20 16:58:17 +03:00
Botond Dénes	e1c4e6c151	Merge 'sstables_manager: use maintenance scheduling group to run components reload fiber' from Lakshmi Narayanan Sreethar PR https://github.com/scylladb/scylladb/pull/18186 introduced a fiber that reloads reclaimed bloom filters when memory becomes available. Use maintenance scheduling group to run that fiber instead of running it in the main scheduling group. Fixes #18675 Closes scylladb/scylladb#18721 * github.com:scylladb/scylladb: sstables_manager: use maintenance scheduling group to run components reload fiber sstables_manager: add member to store maintenance scheduling group	2024-05-20 16:38:42 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Lakshmi Narayanan Sreethar	79f6746298	sstables_manager: add member to store maintenance scheduling group Store that maintenance scheduling group inside the sstables_manager. The next patch will use this to run the components reloader fiber. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>	2024-05-19 15:23:45 +05:30
Avi Kivity	2fbd78c769	feature: grandfather DIGEST_FOR_NULL_VALUES The DIGEST_FOR_NULL_VALUES feature was added in `21a77612b3` (2020; 4.4) and can now be assumed to be always present. The hasher which it invoked is removed.	2024-05-18 00:24:00 +03:00
Pavel Emelyanov	31d05925cc	api,database: Move auto-compaction toggle guard Toggling per-table auto-compaction enabling bit is guarded with on-database boolean and raii guard. It's only used by a single api/column_family.cc file, so it can live there. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-16 14:42:51 +03:00
Michał Jadwiszczak	3830f3bd23	db/cql_type_parses: futurize raw_builder::build() In order to use generic topological sort, build() method needs to return future.	2024-05-16 13:30:03 +02:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Raphael S. Carvalho	715ae689c0	Implement fast streaming for intra-node migration With intra-node migration, all the movement is local, so we can make streaming faster by just cloning the sstable set of leaving replica and loading it into the pending one. This cloning is underlying storage specific, but s3 doesn't support snapshot() yet (th sstables::storage procedure which clone is built upon). It's only supported by file system, with help of hard links. A new generation is picked for new cloned sstable, and it will live in the same directory as the original. A challenge I bumped into was to understand why table refused to load the sstable at pending replica, as it considered them foreign. Later I realized that sharder (for reads) at this stage of migration will point only to leaving replica. It didn't fail with mutation based streaming, because the sstable writer considers the shard -- that the sstable was written into -- as its owner, regardless of what sharder says. That was fixed by mimicking this behavior during loading at pending. test: ./test.py --mode=dev intranode --repeat=100 passes. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	679baff25a	dht, replica: Remove deprecated sharder APIs	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	539460dd71	db: do_apply_many() avoid deprecated sharded API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	0f50504c39	replica: mutation_dump: Avoid deprecated sharder API	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	7c03646f99	table: Remove optimization which returns empty reader when key is not owned by the shard This check would lead to correctness issues with intra-node migration because the shard may switch during read, from "read old" to "read new". If the coordinator used "read old" for shard routing, but table on the old shard is already using "read new" erm, such a read would observe empty result, which is wrong. Drop the optimization. In the scenario above, read will observe all past writes because: 1) writes are still using "write both" 2) writes are switched to "write new" only after all requests which might be using "read old" are done Replica-side coordinators should already route single-key requests to the correct shard, so it's not important as an optimization. This issue shows how assumptions about static sharding are embedded in the current code base and how intra-node migration, by violating those assumptions, can lead to correctness issues.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	dbca598e99	replica: Deprecate table::shard_of()	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	10a4903d0c	dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard Require users to specify whether we want shard for reads or for writes by switching to appropriate non-deprecated variant. For example, shard_of() can be replaced with shard_for_reads() or shard_for_writes(). The next_shard/token_for_next_shard APIs have only for-reads variant, and the act of switching will be a testimony to the fact that the code is valid for intra-node migration.	2024-05-16 00:28:47 +02:00
Tomasz Grabiec	6c6ce2d928	tablets: Get rid of tablet_map::get_shard() Its semantics do not fit well with intra-node migration which allow two owning shards. Replace uses with the new has_replica() API.	2024-05-16 00:28:46 +02:00
Pavel Emelyanov	59aec1f300	database: Don't break namespace withexternal alias The namespace replica is broken in the middle with sstable_list alias, while the latter can be declared earlier Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18664	2024-05-14 16:45:20 +03:00
Amnon Heiman	0c84692c97	replica/table.cc: Add metrics per-table-per-node This patch adds metrics that will be reported per-table per-node. The added metrics (that are part of the per-table per-shard metrics) are: scylla_column_family_cache_hit_rate scylla_column_family_read_latency scylla_column_family_write_latency scylla_column_family_live_disk_space Fixes #18642 Signed-off-by: Amnon Heiman <amnon@scylladb.com> Closes scylladb/scylladb#18645	2024-05-14 07:54:34 +03:00
Raphael S. Carvalho	0b2ec3063c	sstables: Fix incremental_reader_selector (for range reads) with tablets incremental_reader_selector is the mechanism for incremental comsumption of disjoint sstables on range reads. tablet_sstable_set was implemented, such that selector is efficient with tablets. The problem is selector is vnode addicted and will only consider a given set exhausted when maximum token is reached. With tablets, that means a range read on first tablet of a given shard will also consume other tablets living in the same shard. That results in combined reader having to work with empty sstable readers of tablets that don't intersect with the range of the read. It won't cause extra I/O because the underlying sstables don't intersect with the range of the read. It's only unnecessary CPU work, as it involves creating readers (= allocation), feeding them into combined reader, which will in turn invoke the sstable readers only to realize they don't have any data for that range. With 100k tablets (ranges), and 100 tablets per shard, and ~5 sstables per tablet, there will be this amount of readers (empty or not): (100k * ((100^2 + 100) / 2) * avg_sstable_per_tablet=5) = ~2.5 billions. ~5000 times more readers, it can be quite significant additional cpu work, even though I/O dominates the most in scans. It's an inefficiency that we rather get rid of. The behavior can be observed from logs (there's 1 sstable for each of 4 tablets, but note how readers are created for every single one of them when reading only 1 tablet range): ``` table - make_reader_v2 - range=(-inf, {-4611686018427387905, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {minimum token, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._34qn42... that has range [{-9151620220812943033, start},{-4813568684827439727, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {-4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._368nk2... that has range [{-4599560452460784857, start},{-78043747517466964, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {0, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._38lj42... that has range [{851021166589397842, start},{3516631334339266977, end}] incremental_reader_selector - create_new_readers(null): selecting on pos {4611686018427387904, w=-1} sstable - make_reader - reader on (-inf, {-4611686018427387905, end}] for sst 3gfx_..._3dba82... that has range [{5065088566032249228, start},{9215673076482556375, end}] ``` Fix is about making sure the tablet set won't select past the supplied range of the read. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#18556	2024-05-14 07:43:22 +03:00

1 2 3 4 5 ...

1185 Commits