scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-29 11:10:40 +00:00

Author	SHA1	Message	Date
Pavel Emelyanov	b24fb8dc87	inet_address: Remove to_sstring() in favor of fmt::to_string The existing inet_address::to_string() calls fmt::format("{}", *this) anyway. However, the to_string() method is declared in .cc file, while form formatter is in the header and is equipeed with constexprs so that converting an address to string is done as much as possible compile-time. Also, though minor, fmt::to_string(foo) is believed to be even faster than fmt::format("{}", foo). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Closes scylladb/scylladb#18712	2024-05-21 09:43:08 +03:00
Avi Kivity	61505d057e	Merge 'Sort user-defined types in describe statements' from Michał Jadwiszczak User-defined types can depend on each other, creating directed acyclic graph. In order to support restoring schema from `DESC SCHEMA`, UDTs should be ordered topologically, not alphabetically as it was till now. This patch changes the way UDTs are ordered in `DESC SCHEMA`/`DESC KEYSPACE <ks>` statements, so the output can be safely copy-pasted to restore the schema. Fixes #18539 Closes scylladb/scylladb#18302 * github.com:scylladb/scylladb: test/cql-pytest/test_describe: add test for UDTs ordering cql3/statements/describe_statement: UDTs topological sorting cql3/statements/describe_statement: allow to skip alphabetical sorting types: add a method to get all referenced user types db/cql_type_parser: use generic topological sorting db/cql_type_parses: futurize raw_builder::build() test/boost: add test for topological sorting utils: introduce generic topological sorting algorithm	2024-05-20 16:58:17 +03:00
Avi Kivity	52fe351c31	Merge 'Balance tablets within nodes (intra-node migration)' from Tomasz Grabiec This is needed to avoid severe imbalance between shards which can happen when some table grows and is split. The inter-node balance can be equal, so inter-node migration cannot fix the imbalance. Also, if RF=N then there is not even a possibility of moving tablets around to fix the imbalance. The only way to bring the system to balance is to move tablets within the nodes. The system is not prepared for intra-node migration currently. Request coordination is host-based, while for intra-node migration it should be (also) shard-based. The solution employed here is to keep the coordination between nodes as-is, and for intra-node migration storage_proxy-level coordinator is not aware of the migration (no pending host). The replica-side request handler will be a second-level coordinator which routes requests to shards, similar to how the first-level coordinator routes them to hosts. Tablet sharder is adjusted to handle intra-migration where a tablet can have two replicas on the same host. For reads, sharder uses the read selector to resolve the conflict. For writes, the write selector is used. The old shard_of() API is kept to represent shard for reads, and new method is introduced to query the shards for writing: shard_for_writes(). All writers should be switched to that API, which is not done in this patch yet. The request handler on replica side acts as a second-level coordinator, using sharder to determine routing to shards. A given sharder has a scope of a single topology version, a single effective_replication_map_ptr, which should be kept alive during writes. perf-simple-query test results show no signs of regression: Command: perf-simple-query -c1 -m1G --write --tablets --duration=10 Before: > 83294.81 tps ( 59.5 allocs/op, 14.3 tasks/op, 53725 insns/op, 0 errors) > 87756.72 tps ( 59.5 allocs/op, 14.3 tasks/op, 54049 insns/op, 0 errors) > 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > 86211.38 tps ( 59.7 allocs/op, 14.3 tasks/op, 54219 insns/op, 0 errors) > 86559.89 tps ( 59.6 allocs/op, 14.3 tasks/op, 54188 insns/op, 0 errors) > 86609.39 tps ( 59.6 allocs/op, 14.3 tasks/op, 54117 insns/op, 0 errors) > 87464.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 54039 insns/op, 0 errors) > 86185.43 tps ( 59.6 allocs/op, 14.3 tasks/op, 54169 insns/op, 0 errors) > 86254.71 tps ( 59.6 allocs/op, 14.3 tasks/op, 54139 insns/op, 0 errors) > 83395.35 tps ( 60.2 allocs/op, 14.4 tasks/op, 54693 insns/op, 0 errors) > > median 86428.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54208 insns/op, 0 errors) > median absolute deviation: 243.04 > maximum: 87756.72 > minimum: 83294.81 > After: > 85523.06 tps ( 59.5 allocs/op, 14.3 tasks/op, 53872 insns/op, 0 errors) > 89362.47 tps ( 59.6 allocs/op, 14.3 tasks/op, 54226 insns/op, 0 errors) > 88167.55 tps ( 59.7 allocs/op, 14.3 tasks/op, 54400 insns/op, 0 errors) > 87044.40 tps ( 59.7 allocs/op, 14.3 tasks/op, 54310 insns/op, 0 errors) > 88344.50 tps ( 59.6 allocs/op, 14.3 tasks/op, 54289 insns/op, 0 errors) > 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > 88725.46 tps ( 59.6 allocs/op, 14.3 tasks/op, 54230 insns/op, 0 errors) > 88640.08 tps ( 59.6 allocs/op, 14.3 tasks/op, 54210 insns/op, 0 errors) > 90306.31 tps ( 59.4 allocs/op, 14.3 tasks/op, 54043 insns/op, 0 errors) > 87343.62 tps ( 59.8 allocs/op, 14.3 tasks/op, 54496 insns/op, 0 errors) > > median 88355.06 tps ( 59.6 allocs/op, 14.3 tasks/op, 54242 insns/op, 0 errors) > median absolute deviation: 1007.41 > maximum: 90306.31 > minimum: 85523.06 Command (reads): perf-simple-query -c1 -m1G --tablets --duration=10 Before: > 95860.18 tps ( 63.1 allocs/op, 14.1 tasks/op, 42476 insns/op, 0 errors) > 97537.69 tps ( 63.1 allocs/op, 14.1 tasks/op, 42454 insns/op, 0 errors) > 97549.23 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97511.29 tps ( 63.1 allocs/op, 14.1 tasks/op, 42470 insns/op, 0 errors) > 97227.32 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 94031.94 tps ( 63.1 allocs/op, 14.1 tasks/op, 42441 insns/op, 0 errors) > 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > 96401.70 tps ( 63.1 allocs/op, 14.1 tasks/op, 42473 insns/op, 0 errors) > 96573.77 tps ( 63.1 allocs/op, 14.1 tasks/op, 42440 insns/op, 0 errors) > 96340.54 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > > median 96978.04 tps ( 63.1 allocs/op, 14.1 tasks/op, 42462 insns/op, 0 errors) > median absolute deviation: 571.20 > maximum: 97549.23 > minimum: 94031.94 > After: > 99794.67 tps ( 63.1 allocs/op, 14.1 tasks/op, 42471 insns/op, 0 errors) > 101244.99 tps ( 63.1 allocs/op, 14.1 tasks/op, 42472 insns/op, 0 errors) > 101128.37 tps ( 63.1 allocs/op, 14.1 tasks/op, 42485 insns/op, 0 errors) > 101065.27 tps ( 63.1 allocs/op, 14.1 tasks/op, 42465 insns/op, 0 errors) > 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > 101413.31 tps ( 63.1 allocs/op, 14.1 tasks/op, 42463 insns/op, 0 errors) > 101464.92 tps ( 63.1 allocs/op, 14.1 tasks/op, 42466 insns/op, 0 errors) > 101086.74 tps ( 63.1 allocs/op, 14.1 tasks/op, 42488 insns/op, 0 errors) > 101559.09 tps ( 63.1 allocs/op, 14.1 tasks/op, 42468 insns/op, 0 errors) > 100742.58 tps ( 63.1 allocs/op, 14.1 tasks/op, 42491 insns/op, 0 errors) > > median 101212.98 tps ( 63.1 allocs/op, 14.1 tasks/op, 42456 insns/op, 0 errors) > median absolute deviation: 200.33 > maximum: 101559.09 > minimum: 99794.67 > Fixes #16594 Closes scylladb/scylladb#18026 * github.com:scylladb/scylladb: Implement fast streaming for intra-node migration test: tablets_test: Test sharding during intra-node migration test: tablets_test: Check sharding also on the pending host test: py: tablets: Test writes concurrent with migration test: py: tablets: Test crash during intra-node migration api, storage_service: Introduce API to wait for topology to quiesce dht, replica: Remove deprecated sharder APIs test: Avoid using deprecated sharded API db: do_apply_many() avoid deprecated sharded API replica: mutation_dump: Avoid deprecated sharder API repair: Avoid deprecated sharder API table: Remove optimization which returns empty reader when key is not owned by the shard dht: is_single_shard: Avoid deprecated sharder API dht: split_range_to_single_shard: Work with static_sharder only dht: ring_position_range_sharder: Avoid deprecated sharder APIs dht: token: Avoid use of deprecated sharder API by switching to static_sharder selective_token_sharder: Avoid use of deprecated sharder API docs: Document tablet sharding vs tablet replica placement readers/multishard.cc: use shard_for_reads() instead of shard_of() multishard_mutation_query.cc: use shard_for_reads() instead of shard_of() storage_proxy: Extract common code to apply mutations on many shards according to sharder storage_proxy: Prepare per-partition rate-limiting for intra-node migration storage_proxy: Avoid shard_of() use in mutate_counter_on_leader_and_replicate() storage_proxy: Prepare mutate_hint() for intra-node tablet migration commitlog_replayer: Avoid deprecated sharder::shard_of() lwt: Avoid deprecated sharder::shard_of() compaction: Avoid deprecated sharder::shard_of() dht: Extract dht::static_sharder replica: Deprecate table::shard_of() locator: Deprecate effective_replication_map::shard_of() dht: Deprecate old sharder API: shard_of/next_shard/token_for_next_shard tests: tablets: py: Add intra-node migration test tests: tablets: Test that drained nodes are not balanced internally tests: tablets: Add checks of replica set validity to test_load_balancing_with_random_load tests: tablets: Verify that disabling balancing results in no intra-node migrations tests: tablets: Check that nodes are internally balanced tests: tablets: Improve debuggability by showing which rows are missing tablets, storage_service: Support intra-node migration in move_tablet() API tablet_allocator: Generate intra-node migration plan tablet_allocator: Extract make_internode_plan() tablet_allocator: Maintain candidate list and shard tablet count for target nodes tablet_allocator: Lift apply_load/can_accept_load lambdas to member functions tablets, streaming: Implement tablet streaming for intra-node migration dht, auto_refreshing_sharder: Allow overriding write selector multishard_writer: Handle intra-node migration storage_proxy: Handle intra-node tablet migration for writes tablets: Get rid of tablet_map::get_shard() tablets: Avoid tablet_map::get_shard in cleanup tablets: test: Use sharder instead of tablet_map::get_shard() tablets: tablet_sharder: Allow working with non-local host sharding: Prepare for intra-node-migration docs: Document sharder use for tablets tablets: Introduce tablet transition kind for intra-node migration tests: tablets: Fix use-after-move of skiplist in rebalance_tablets() sstables, gdb: Track readers in a linked list raft topology: Fix global token metadata barrier to not fence ahead of what is drained	2024-05-20 16:13:01 +03:00
Avi Kivity	54a82fed6b	feature, index: grandfather CORRECT_IDX_TOKEN_IN_SECONDARY_INDEX This feature corrected how we store the token in secondary indexes. It was introduced in `7ff72b0ba5` (2020; 4.4) and can now be assumed present everywhere. Note that we still support indexes created with the old format.	2024-05-18 00:24:11 +03:00
Avi Kivity	3bead8cea0	feature: grandfather PER_TABLE_PARTITIONERS The PER_TABLE_PARTITIONERS feature was added in `90df9a44ce` (2020; 4.0) and can now be assumed to be always present. We also remove the associated schema_feature.	2024-05-18 00:15:07 +03:00
Avi Kivity	c7d7ca2c23	feature: grandfather CDC The CDC feature was made non-experimental in `e9072542c1` (2020; 4.4) and can now be assumed to be always present. We also remove the corresponding schema_feature.	2024-05-17 20:41:20 +03:00
Avi Kivity	b5f6021a6b	feature: grandfather VIEW_VIRTUAL_COLUMNS The VIEW_VIRTUAL_COLUMNS feature was added in `a108df09f9` (2019; 3.1) and can now be assumed to be always present. The corresponding schema_feature is removed. Note schema_features are not sent over the wire. A digest calculation without VIEW_VIRTUAL_COLUMNS is no longer tested.	2024-05-17 20:41:19 +03:00
Avi Kivity	7952200c8c	feature: grandfather ME_SSTABLE feature "me" format sstables were introduced in `d370558279` (Jan 2022; 5.1) and so can be assumed always present. The listener that checks when the cluster understands ME_SSTABLE was removed and in its place we default to sstable_version_types::me (and call on_enabled() immediately).	2024-05-17 20:41:19 +03:00
Kefu Chai	617e532859	db: config: drop operator<<() for error_injection_at_startup it is not used anymore, so let's drop it. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18701	2024-05-16 15:10:57 +03:00
Michał Jadwiszczak	573e13e3f1	db/cql_type_parser: use generic topological sorting	2024-05-16 13:30:03 +02:00
Michał Jadwiszczak	3830f3bd23	db/cql_type_parses: futurize raw_builder::build() In order to use generic topological sort, build() method needs to return future.	2024-05-16 13:30:03 +02:00
Piotr Dulikowski	68eca3778c	Merge 'mv: throttle view update generation for large queries' from Wojciech Mitros This series is a reupload of #13792 with a few modifications, namely a test is added and the conflicts with recent tablet related changes are fixed. See https://github.com/scylladb/scylladb/issues/12379 and https://github.com/scylladb/scylladb/pull/13583 for a detailed description of the problem and discussions. This PR aims to extend the existing throttling mechanism to work with requests that internally generate a large amount of view updates, as suggested by @nyh. The existing mechanism works in the following way: * Client sends a request, we generate the view updates corresponding to the request and spawn background tasks which will send these updates to remote nodes * Each background task consumes some units from the `view_update_concurrency_semaphore`, but doesn't wait for these units, it's just for tracking * We keep track of the percent of consumed units on each node, this is called `view update backlog`. * Before sending a response to the client we sleep for a short amount of time. The amount of time to sleep for is based on the fullness of this `view update backlog`. For a well behaved client with limited concurrency this will limit the amount of incoming requests to a manageable level. This mechanism doesn't handle large DELETE queries. Deleting a partition is fast for the base table, but it requires us to generate a view update for every single deleted row. The number of deleted rows per single client request can be in the millions. Delaying response to the request doesn't help when a single request can generate millions of updates. To deal with this we could treat the view update generator just like any other client and force it to wait a bit of time before sending the next batch of updates. The amount of time to wait for is calculated just like in the existing throttling code, it's based on the fullness of `view update backlogs`. The new algorithm of view update generation looks something like this: ```c++ for(;;) { auto updates = generate_updates_batch_with_max_100_rows(); co_await seastar::sleep(calculate_sleep_time_from_backlogs()); spawn_background_tasks_for_updates(updates); } ``` Fixes: https://github.com/scylladb/scylladb/issues/12379 Closes scylladb/scylladb#16819 * github.com:scylladb/scylladb: test: add test for bad_allocs during large mv queries mv: throttle view update generation for large queries exceptions: add read_write_timeout_exception, a subclass of request_timeout_exception db/view: extract view throttling delay calculation to a global function view_update_generator: add get_storage_proxy() storage_proxy: make view backlog getters public	2024-05-16 08:22:54 +02:00
Tomasz Grabiec	feafe0f6a7	commitlog_replayer: Avoid deprecated sharder::shard_of() shard_for_writes() is appropriate, because we're writing. It can happen that the tablet was migrated away and no shard is the owner. In that case the mutation is dropped, as it should be, because "shards" is empty.	2024-05-16 00:28:47 +02:00
Piotr Dulikowski	448f651049	Merge 'hinted handoff: Prevent segmentation fault when initializing endpoint managers ' from Dawid Mędrek We don't attempt to create an endpoint manager for a hint directory if there is no mapping host ID–IP corresponding to the directory's name, an IP address. That prevents a segmentation fault. Fixes scylladb/scylladb#18649 Closes scylladb/scylladb#18650 * github.com:scylladb/scylladb: db/hints: Remove an unused header db/hints: Remove migrating flag before initializing endpoint managers db/hints: Prevent segmentation fault when initializing endpoint managers	2024-05-14 07:34:16 +02:00
Wojciech Mitros	485eb7a64c	test: add test for bad_allocs during large mv queries This patch adds a test for reproducing issue #12379, which is being fixed in #16819. The test case works by creating a table with a materialized view, and then performing a partition delete query on it. At the same time, it uses injections to limit the memory to a level lower than usual, in order to increase the consistency of the test, and to limit its runtime. Before #16819, the test would exceed the limit and fail, and now the next allocation is throttled using a sleep.	2024-05-13 18:16:39 +02:00
Jan Ciolek	e0442d7bfa	mv: throttle view update generation for large queries For every mutation applied to the base table we have to generate the corresponding materialized view table updates. In case of simple requests, like INSERT or UPDATE, the number of view updates generated per base table mutation is limited to at most a few view table updates per base table update. The situation is different for DELETE queries, which can delete the whole partitions or clustering ranges. Range deletions are fast on the base table, but for the view table the situation is different. Deleting a single partition in the base table will generate as many singular view updates as there are rows in the deleted partition, which could potentially be in the millions. To prevent OOM view updates are generated in batches of at most 100 rows. There is a loop which generates the next batch of updates, spawns tasks to send them to remote nodes, generates another batch and so on. The problem is that there is no concurrency control - each batch is scheduled to be sent in the background, but the following batch is generated without waiting for the previously generated updates to be sent. This can lead to unbounded concurrency and OOM. To protect against this view update generation should be limited somehow. There is an existing mechanism for limiting view updates - throttling. We keep track of how many pending view updates there are, in the view backlog, and delay responses to the client based on this backlog's fullness. For a well behaved client with limited concurrency this will slow down the amount of incoming requests until it reaches an optimal point. This works for simple queries (INSERT, UPDATE, ...), but it doesn't do anything for range DELETEs. A DELETE is a single request that generates millions of view updates, delaying client response doesn't help. The throttling mechanism could be extend to cover this case - we could treat the DELETE request like any other client and force it to wait before sending more updates. This commit implements this approach - before sending the next batch of updates the generator is forced to sleep for a bit of time, calculated using the exisiting throttling equation. The more full the backlog gets the more the generator will have to sleep for, and hopefully this will prevent overloading the system with view updates. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:16:23 +02:00
Jan Ciolek	ae28b8bdb7	db/view: extract view throttling delay calculation to a global function In order to prevent overload caused by too many view updates, their number is limited by delaying client responses. The amount of time to delay for is calculated based on the fullness of the view update backlog. Currently this is done in the function calculate_delay, used by abstract_write_response_handler. In the following commits I will introduce another throttling mechanism that uses the same equation to calculate wait time, so it would be good to reuse the exsiting function. Let's make the function globally accessible. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-13 18:14:56 +02:00
Dawid Medrek	ef8f14d44b	db/hints: Remove an unused header	2024-05-13 16:40:47 +02:00
Dawid Medrek	c9bbb92b1a	db/hints: Remove migrating flag before initializing endpoint managers Before these changes, if initializing endpoint managers after the migration of hinted handoff to host ID is done throws an exception, we don't remove the flag indicating the migration is still in progress. However, the migration has, in practice, finished -- all of the hint directories have been mapped to host IDs and all of the nodes in the cluster are host-ID-based. Because of that, it makes sense to remove the flag early on.	2024-05-13 16:40:47 +02:00
Dawid Medrek	bdcde0c210	db/hints: Prevent segmentation fault when initializing endpoint managers If hinted handoff is still IP-based and there is a hint directory representing an IP without a corresponding mapping to a host ID in `locator::token_metadata`, an attemp to initialize its endpoint manager will result in a segmentation fault. This commit prevents that.	2024-05-13 16:40:47 +02:00
Asias He	952dfc6157	repair: Introduce repair_partition_count_estimation_ratio config option In commit `642f9a1966` (repair: Improve estimated_partitions to reduce memory usage), a 10% hard coded estimation ratio is used. This patch introduces a new config option to specify the estimation ratio of partitions written by repair out of the total partitions. It is set to 0.1 by default. Fixes #18615 Closes scylladb/scylladb#18634	2024-05-13 15:16:55 +03:00
Avi Kivity	cc8b4e0630	batchlog_manager, test: initialize delay configuration In `b4e66ddf1d` (4.0) we added a new batchlog_manager configuration named delay, but forgot to initialize it in cql_test_env. This somehow worked, but doesn't with clang 18. Fix it by initializing to 0 (there isn't a good reason to delay it). Also provide a default to make it safer. Closes scylladb/scylladb#18572	2024-05-13 07:57:35 +03:00
Kefu Chai	2a9a874e19	db,service: fix typos in comments Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18567	2024-05-09 08:26:44 +03:00
Calle Wilund	79d56ccaad	commitlog: Fix request_controller semaphore accounting. Fixes #18488 Due to the discrepancy between bytes added to CL and bytes written to disk (due to CRC sector overhead), we fail to account for the proper byte count when issuing account_memory_usage in allocate (using bytes added) and in cycle:s notify_memory_written (disk bytes written). This leads us to slowly, but surely, add to the semaphore all the time. Eventually rendering it useless. Also, terminate call would _not_ take any of this into account, and the chunk overhead there would cause a (smaller) discrepancy as well. Fix by simply ensuring that buffer alloc handles its byte usage, then accounting based on buffer position, not input byte size. Closes scylladb/scylladb#18489	2024-05-09 08:26:44 +03:00
Botond Dénes	155332ebf8	Merge 'Drain view_builder in generic drain (again)' from Pavel Emelyanov Some time ago #16558 was merged that moved view builder drain into generic drain. After this merge dtests started to fail from time to time, so the PR was reverted (see #18278). In #18295 the hang was found. View builder drain was moved from "before stopping messaging service to "after" it, and view update write handlers in proxy hanged for hard-coded timeout of 5 minutes without being aborted. Tests don't wait for 5 minutes and kill scylla, then complain about it and fail. This PR brings back the original PR as well as the necessary fix that cancels view update write handlers on stop. Closes scylladb/scylladb#18408 * github.com:scylladb/scylladb: Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" view: Abort pending view updates when draining	2024-05-09 08:26:44 +03:00
Botond Dénes	96a7ed7efb	Merge 'sstables: add dead row count when issuing warning to system.large_partitions' from Ferenc Szili This is the second half of the fix for issue #13968. The first half is already merged with PR #18346 Scylla issues warnings for partitions containing more rows than a configured threshold. The warning is issued by inserting a row into the `system.large_partitions` table. This row contains the information about the partition for which the warning is issued: keyspace, table, sstable, partition key and size, compaction time and the number of rows in the partition. A previous PR #18346 also added range tombstone count to this row. This change adds a new counter for dead rows to the large_partitions table. This change also adds cluster feature protection for writing into these new counters. This is needed in case a cluster is in the process of being upgraded to this new version, after which an upgraded node writes data with the new schema into `system.large_partitions`, and finally a node is then rolled back to an old version. This node will then revert the schema to the old version, but the written sstables will still contain data with the new counters, causing any readers of this table to throw errors when they encounter these cells. This is an enhancement, and backporting is not needed. Fixes #13968 Closes scylladb/scylladb#18458 * github.com:scylladb/scylladb: sstable: added test for counting dead rows sstable: added docs for system.large_partitions.dead_rows sstable: added cluster feature for dead rows and range tombstones sstable: write dead_rows count to system.large_partitions sstable: added counter for dead rows	2024-05-09 08:26:43 +03:00
Kamil Braun	03818c4aa9	direct_failure_detector: increase ping timeout and make it tunable The direct failure detector design is simplistic. It sends pings sequentially and times out listeners that reached the threshold (i.e. didn't hear from a given endpoint for too long) in-between pings. Given the sequential nature, the previous ping must finish so the next ping can start. We timeout pings that take too long. The timeout was hardcoded and set to 300ms. This is too low for wide-area setups -- latencies across the Earth can indeed go up to 300ms. 3 subsequent timed out pings to a given node were sufficient for the Raft listener to "mark server as down" (the listener used a threshold of 1s). Increase the ping timeout to 600ms which should be enough even for pinging the opposite side of Earth, and make it tunable. Increase the Raft listener threshold from 1s to 2s. Without the increased threshold, one timed out ping would be enough to mark the server as down. Increasing it to 2s requires 3 timed out pings which makes it more robust in presence of transient network hiccups. In the future we'll most likely want to decrease the Raft listener threshold again, if we use Raft for data path -- so leader elections start quickly after leader failures. (Faster than 2s). To do that we'll have to improve the design of the direct failure detector. Ref: scylladb/scylladb#16410 Fixes: scylladb/scylladb#16607 --- I tested the change manually using `tc qdisc ... netem delay`, setting network delay on local setup to ~300ms with jitter. Without the change, the result is as observed in scylladb/scylladb#16410: interleaving ``` raft_group_registry - marking Raft server ... as dead for Raft groups raft_group_registry - marking Raft server ... as alive for Raft groups ``` happening once every few seconds. The "marking as dead" happens whenever we get 3 subsequent failed pings, which is happens with certain (high) probability depending on the latency jitter. Then as soon as we get a successful ping, we mark server back as alive. With the change, the phenomenon no longer appears. Closes scylladb/scylladb#18443	2024-05-07 23:40:23 +02:00
Piotr Dulikowski	64ba620dc2	Merge 'hinted handoff: Use host IDs instead of IPs in the module' from Dawid Mędrek This pull request introduces host ID in the Hinted Handoff module. Nodes are now identified by their host IDs instead of their IPs. The conversion occurs on the boundary between the module and `storage_proxy.hh`, but aside from that, IPs have been erased. The changes take into considerations that there might still be old hints, still identified by IPs, on disk – at start-up, we map them to host IDs if it's possible so that they're not lost. Refs scylladb/scylladb#6403 Fixes scylladb/scylladb#12278 Closes scylladb/scylladb#15567 * github.com:scylladb/scylladb: docs: Update Hinted Handoff documentation db/hints: Add endpoint_downtime_not_bigger_than() db/hints: Migrate hinted handoff when cluster feature is enabled db/hints: Handle arbitrary directories in resource manager db/hints: Start using hint_directory_manager db/hints: Enforce providing IP in get_ep_manager() db/hints: Introduce hint_directory_manager db/hints/resource_manager: Update function description db/hints: Coroutinize space_watchdog::scan_one_ep_dir() db/hints: Expose update lock of space watchdog db/hints: Add function for migrating hint directories to host ID db/hints: Take both IP and host ID when storing hints db/hints: Prepare initializing endpoint managers for migrating from IP to host ID db/hints: Migrate to locator::host_id db/hints: Remove noexcept in do_send_one_mutation() service: Add locator::host_id to on_leave_cluster service: Fix indentation db/hints: Fix indentation	2024-05-06 09:58:18 +02:00
Benny Halevy	ebff5f5d70	everywhere: include seastar headers using angle brackets seastar is an external library therefore it should use the system-include syntax. Closes scylladb/scylladb#18513	2024-05-06 10:00:31 +03:00
Kefu Chai	0b0e661a85	build: bring abseil submodule back because of https://bugzilla.redhat.com/show_bug.cgi?id=2278689, the rebuilt abseil package provided by fedora has different settings than the ones if the tree is built with the sanitizer enabled. this inconsistency leads to a crash. to address this problem, we have to reinstate the abseil submodule, so we can built it with the same compiler options with which we build the tree. in this change * Revert "build: drop abseil submodule, replace with distribution abseil" * update CMake building system with abseil header include settings * bump up the abseil submodule to the latest LTS branch of abseil: lts_2024_01_16 * update scylla-gdb.py to adapt to the new structure of flat_hash_map This reverts commit `8635d24424`. Signed-off-by: Kefu Chai <kefu.chai@scylladb.com> Closes scylladb/scylladb#18511	2024-05-05 23:31:09 +03:00
Benny Halevy	7f372dd9ae	schema_tables: convert_schema_to_mutations: make_canonical_mutation_gently To prevent stalls due to large schema mutations. While at it, reserve the result canonical_mutation vector. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:37:05 +03:00
Benny Halevy	61dea98185	schema_tables: redact_columns_for_missing_features: get input mutation using rvalue reference The function upgrades the input mutation only in certain cases. Currently it accepts the input mutation by value, which may cause and extraneous copy if the caller doesn't move the mutation, as done in `adjust_schema_for_schema_features`. Getting an rvalue reference instead makes the interface clearer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:37:05 +03:00
Benny Halevy	a126160d7e	frozen_mutation: move unfreeze_gently to async_utils Unfreeze_gently doesn't have to be a method of frozen_mutation. It might as well be implemented as a free function reading from a frozen_mutation and preparing a mutation gently. The logic will be used in a later patch to make a canonical mutation directly from a frozen_mutation instead of unfreezing it and then converting it to a canonical_mutation. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2024-05-02 19:27:56 +03:00
Ferenc Szili	90634b419c	sstable: added cluster feature for dead rows and range tombstones Previously, writing into system.large_partitions was done by calling record_large_partition(). In order to write different data based on the cluster feature flag, another level of indirection was added by calling _record_large_partitions which is initialized to a lambda which calls internal_record_large_partitions(). This function does not record the values of the two new columns (dead_rows and range_tombstones). After the cluster feature flag becomes true, _record_large_partitions is set to a lambda which calls internal_record_large_partitions_all_data() which record the values of the two new columns.	2024-05-02 11:49:46 +02:00
Ferenc Szili	b06af5b2b9	sstable: write dead_rows count to system.large_partitions	2024-05-02 11:49:10 +02:00
Jan Ciolek	59b7920b0b	view_update_generator: add get_storage_proxy() During view generation we would like to be able to access information about the current state of view update backlogs, but this information is kept inside storage_proxy. A reference to storage_proxy is kept inside view_update_generator, so the easiest way to get access to it from the view update code is by adding a public getter there. There's already a similar getter for replica::database: get_db(), so it's in line with the rest of the code. Signed-off-by: Jan Ciolek <jan.ciolek@scylladb.com>	2024-05-02 10:59:55 +02:00
Pavel Emelyanov	67736b5cd3	Reapply "Merge 'Drain view_builder in generic drain' from ScyllaDB" This reverts commit `9c2a836607`.	2024-05-02 08:16:14 +03:00
Pavel Emelyanov	d47053266b	view: Abort pending view updates when draining When view builder is drained (it now happens very early, but next patch moves this into regular drain) it waits for all on-going view build steps to complete. This includes waiting for any outstanding proxy view writes to complete as well. View writes in proxy have very high timeout of 5 minutes but they are cancellable. However, canecelling of such writes happens in proxy's drain_on_shutdown() call which, in turn, happens pretty late on shutdown. Effectively, by the time it happens all view writes mush have completed already, so stop-time cancelling doesn't really work nowadays. Next patch makes view builder drain happen a bit later during shutdown, namely -- _after_ shutting down messaging service. When it happen that late, non-working view writes cancellation becomes critical, as view builder drain hangs for aforementioned 5 minutes. This patch explicitly cancels all view writes when view builder stops. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2024-05-02 08:16:12 +03:00
Botond Dénes	65a385f5d0	Merge 'Relax the way view builder code checks if a table exists' from Pavel Emelyanov There are two places that workaround db.column_family_exists() call with some fancy exceptions-catching lambda. This PR makes things simpler. Closes scylladb/scylladb#18441 * github.com:scylladb/scylladb: view: Open-code one line lambda checking if table exists view: Use non-throwoing check if a table exists	2024-05-01 10:14:58 +03:00
Dawid Medrek	46ab22f805	db/hints: Add endpoint_downtime_not_bigger_than() We add an auxiliary function checking if a node hasn't been down for too long. Although `gms::gossiper` provides already exposes a function responsible for that, it requires that its argument be an IP address. That's the reason we add a new function.	2024-04-28 01:22:59 +02:00
Dawid Medrek	0ef8d67d32	db/hints: Migrate hinted handoff when cluster feature is enabled These changes migrate hinted handoff to using host ID as soon as the corresponding cluster feature is enabled. When a node starts, it defaults to creating directories naming them after IP addresses. When the whole cluster has upgraded to a version of Scylla that can handle directories representing host IDs, we perform a migration of the IP folders, i.e. we try to rename them to host IDs. Invalid directories, i.e. those that represent neither an IP address, nor a host ID, are removed. During the migration, hinted handoff is disabled. It is necessary because we have to modify the disk's contents, so new hints cannot be saved until the migration finishes.	2024-04-28 01:22:57 +02:00
Dawid Medrek	58784cd8db	db/hints: Handle arbitrary directories in resource manager Before these changes, resource manager only handled the case when directories it browsed represented valid host IDs. However, since before migrating hinted handoff to using host IDs we still name directories after IP addresses, that would lead to exceptins that shouldn't happen. We make resource manager handle directories of arbitrary names correctly.	2024-04-27 22:31:07 +02:00
Dawid Medrek	ee84e810ca	db/hints: Start using hint_directory_manager We start keeping track of mappings IP - host ID. The mappings are between endpoint managers (identified by host IDs) and the hint directories managed by them (represented by IP addresses). This is a prelude to handling IP directories by the hint shard manager. The structure should only be used by the hint manager before it's migrated to using host IDs. The reason for that is that we rely on the information obtained from the structure, but it might not make sense later on. When we start creating directories named after host IDs and there are no longer directories representing IP addresses, there is no relation between host IDs and IPs -- just because the structure is supposed to keep track between endpoint managers and hint directories that represent IP addresses. If they represent host IDs, the connection between the two is lost. Still using the data structure could lead to bugs, e.g. if we tried to associate a given endpoint manager's host ID with its corresponding IP address from locator::token_metadata, it could happen that two different host IDs would be bound to the same IP address by the data structure: node A has IP I1, node A changes its IP to I2, node B changes its IP to I1. Though nodes A and B have different host IDs (because they are unique), the code would try to save hints towards node B in node A's hint directory, which should NOT happen. Relying on the data structure is thus only safe before migrating hinted handoff to using host IDs. It may happen that we save a hint in the hint directory of the wrong node indeed, but since migration to using host IDs is a process that only happens once, it's a price we are ready to pay. It's only imperative to prevent it from happening in normal circumstances.	2024-04-27 22:31:07 +02:00
Dawid Medrek	aa4b06a895	db/hints: Enforce providing IP in get_ep_manager() We drop the default argument in the function's signature. Also, we adjust the code of change_host_filter() to be able to perform calls to get_ep_manager().	2024-04-27 22:31:07 +02:00
Dawid Medrek	d0f58736c8	db/hints: Introduce hint_directory_manager This commit introduces a new class responsible for keeping track of mappings IP-host ID. Before hinted handoff is migrated to using host IDs, hint directories still have to represent IP addresses. However, since we identify endpoint managers by host IDs already, we need to be able to associate them with the directories they manage. This class serves this purpose.	2024-04-27 22:31:07 +02:00
Dawid Medrek	f9af01852d	db/hints/resource_manager: Update function description The current description of the function `space_watchdog::scan_one_ep_dir` is not up-to-date with the function's signature. This commit updates it.	2024-04-27 22:31:07 +02:00
Dawid Medrek	59d49c5219	db/hints: Coroutinize space_watchdog::scan_one_ep_dir()	2024-04-27 22:31:07 +02:00
Dawid Medrek	8fd9c80387	db/hints: Expose update lock of space watchdog We expose the update lock of space watchdog to be able to prevent it from scanning hint directories. It will be necessary in an upcoming commit when we will be renaming hint directories and possibly removing some of them. Race conditions are unacceptable, so resource manager cannot be able to access the directory during that time.	2024-04-27 22:31:07 +02:00
Dawid Medrek	934e4bb45e	db/hints: Add function for migrating hint directories to host ID We add a function that will be used while migrating hinted handoff to using host IDs. It iterates over existing hint directories and tries to rename them to the corresponding host IDs. In case of a failure, we remove it so that at the end of its execution the only remaining directories are those that represent host IDs.	2024-04-27 22:31:04 +02:00
Dawid Medrek	e36f853f9b	db/hints: Take both IP and host ID when storing hints The store_hint() method starts taking both an IP and a host ID as its arguments. The rationale for the change is depending on the stage of the cluster (before an upgrade to the host-ID-based hinted handdof and after it), we might need to create a directory representing either an IP address, or a host ID. Because locator::topology can change in the before obtaining the host ID we pass and when the function is being executed, we need to pass both parameters explicitly to ensure the consistency between them.	2024-04-27 20:35:58 +02:00

1 2 3 4 5 ...

3765 Commits