scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-04-29 04:37:00 +00:00

Author	SHA1	Message	Date
Wojciech Mitros	055a6c2cee	storage_proxy: send hints to pending replicas Consider the following scenario: - Current replica set is [A, B, C] - write succeeds on [A, B], and a hint is logged for node C - before the hint is replayed, D bootstraps and the token migrates from C to D - hint is replayed to node C while D is pending, but it's too late, since streaming for that token is already done - C is cleaned up, replayed data is lost, and D has a stale copy until next repair. In the scenario we effectively fail to send the hint. This scenario is also more likely to happen with tablets, as it can happen for every tablet migration. This issue is particularly detrimental to materialized views. View updates use hints by default and a specific view update may be sent to just one view replica (when a single base replica has a different row state due to reordering or missed writes). When we lose a hint for such a view update, we can generate a persistent inconsistency between the base and view - ghost rows can appear due to a lost tombstone and rows may be missing in the view due to a lost row update. Such inconsistencies can't be fixed neither by repairing the view or the base table. To handle this, in this patch we add the pending replicas to the list of targets of each hint, even if the original target is still alive. This will cause some updates to be redundant. These updates are probably unavoidable for now, but they shouldn't be too common either. The scenarios for them are: 1. managing to send the hint to the source of a migrating replica before streaming that its token - the write will arrive on the pending replica anyway in streaming 2. the hint target not being the source of the migration - if we managed to apply the original write of the hint to the actual source of the migration, the pending replica will get it during streaming 3. sending the same hint to many targets at a similar time - while sending to each target, we'll see the same pending replica for the hint so we'll send it multiple times 4. possible retries where even though the hint was successfully sent to the main target, we failed to send it to the pending replica, so we need to retry the entire write This patch handles both tablet migrations and tablet rebuilds. In the future, for tablet migrations, we can avoid sending the hint to pending replias if the hint target is not the source fo the migration, which would allow us to avoid the redundant writes 2 and 3. For rack-aware RF, this will be as simple as checking whether the replicas are in the same rack. We also add a test case reproducing the issue. Co-Authored-By: Raphael S. Carvalho <raphaelsc@scylladb.com> Fixes https://github.com/scylladb/scylladb/issues/19835 Closes scylladb/scylladb#25590 (cherry picked from commit `10b8e1c51c`) Closes scylladb/scylladb#25882	2025-09-10 10:29:52 +03:00
Calle Wilund	2bbf3cf669	system_keyspace: Prune dropped tables from truncation on start/drop Fixes #25683 Once a table drop is complete, there should be no reason to retain truncation records for it, as any replay should skip mutations anyway (no CF), and iff we somehow resurrect a dropped table, this replay-resurrected data is the least problem anyway. Adds a prune phase to the startup drop_truncation_rp_records run, which ignores updating, and instead deletes records for non-existant tables (which should patch any existing servers with lingering data as well). Also does an explicit delete of records on actual table DROP, to ensure we don't grow this table more than needed even in long uptime nodes. Small unit test included. Closes scylladb/scylladb#25699 (cherry picked from commit `bc20861afb`) Closes scylladb/scylladb#25815	2025-09-05 19:02:39 +03:00
Pavel Emelyanov	d484837a2a	Merge '[Backport 2025.3] db/hints: Improve logs' from Scylladb[bot] Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information. Fixes scylladb/scylladb#25466 Backport: There is no risk in backporting these changes. They only have impact on the logs. On the other hand, they might prove helpful when debugging an issue in hinted handoff. - (cherry picked from commit `2327d4dfa3`) - (cherry picked from commit `d7bc9edc6c`) - (cherry picked from commit `6f1fb7cfb5`) Parent PR: #25470 Closes scylladb/scylladb#25538 * github.com:scylladb/scylladb: db/hints: Add new logs db/hints: Adjust log levels db/hints: Improve logs	2025-09-04 11:36:30 +03:00
Pavel Emelyanov	ad6dbcfdc5	Merge '[Backport 2025.3] generic server: 2 step shutdown' from Scylladb[bot] This PR implements solution proposed in scylladb/scylladb#24481 Instead of terminating connections immediately, the shutdown now proceeds in two stages: first closing the receive (input) side to stop new requests, then waiting for all active requests to complete before fully closing the connections. The updated shutdown process is as follows: 1. Initial Shutdown Phase * Close the accept gate to block new incoming connections. * Abort all accept() calls. * For all active connections: * Close only the input side of the connection to prevent new requests. * Keep the output side open to allow responses to be sent. 2. Drain Phase * Wait for all in-progress requests to either complete or fail. 3. Final Shutdown Phase * Fully close all connections. Fixes scylladb/scylladb#24481 - (cherry picked from commit `122e940872`) - (cherry picked from commit `3848d10a8d`) - (cherry picked from commit `3610cf0bfd`) - (cherry picked from commit `27b3d5b415`) - (cherry picked from commit `061089389c`) - (cherry picked from commit `7334bf36a4`) - (cherry picked from commit `ea311be12b`) - (cherry picked from commit `4f63e1df58`) Parent PR: #24499 Closes scylladb/scylladb#25519 * github.com:scylladb/scylladb: test: Set `request_timeout_on_shutdown_in_seconds` to `request_timeout_in_ms`, decrease request timeout. generic_server: Two-step connection shutdown. transport: consmetic change, remove extra blanks. transport: Handle sleep aborted exception in sleep_until_timeout_passes generic_server: replace empty destructor with `= default` generic_server: refactor connection::shutdown to use `shutdown_input` and `shutdown_output` generic_server: add `shutdown_input` and `shutdown_output` functions to `connection` class. test: Add test for query execution during CQL server shutdown	2025-09-04 11:35:55 +03:00
Piotr Dulikowski	debc637ac1	Merge '[Backport 2025.3] system_keyspace: add peers cache to get_ip_from_peers_table' from Scylladb[bot] The gossiper can call `storage_service::on_change` frequently (see scylladb/scylla-enterprise#5613), which may cause high CPU load and even trigger OOMs or related issues. This PR adds a temporary cache for `system.peers` to resolve host_id -> ip without hitting storage on every call. The cache is short-lived to handle the unlikely case where `system.peers` is updated directly via CQL. This is a temporary fix; a more thorough solution is tracked in https://github.com/scylladb/scylladb/issues/25620. Fixes scylladb/scylladb#25660 backport: this patch needs to be backported to all supported versions (2025.1/2/3). - (cherry picked from commit `91c633371e`) - (cherry picked from commit `de5dc4c362`) - (cherry picked from commit `4b907c7711`) Parent PR: #25658 Closes scylladb/scylladb#25766 * github.com:scylladb/scylladb: storage_service: move get_host_id_to_ip_map to system_keyspace system_keyspace: use peers cache in get_ip_from_peers_table storage_service: move get_ip_from_peers_table to system_keyspace	2025-09-01 21:21:26 +02:00
Petr Gusev	c4386c2aa4	storage_service: move get_host_id_to_ip_map to system_keyspace Reimplemented the function to use the peers cache. It could be replaced with get_ip_from_peers_table, but that would create a coroutine frame for each call. (cherry picked from commit `4b907c7711`)	2025-09-01 11:22:55 +02:00
Petr Gusev	7ec3e166c6	system_keyspace: use peers cache in get_ip_from_peers_table The storage_service::on_change method can be called quite often by the gossiper, see scylladb/scylla-enterprise#5613. In this commit we introduce a temporal cache for system.peers so that we don't have to go to the storage each time we need to resolve host_id -> ip. We keep the cache only for a small amount of time to handle the (unlikely) scenario when the user wants to update system.peers table from CQL. Fixes scylladb/scylladb#25660 (cherry picked from commit `de5dc4c362`)	2025-09-01 11:22:05 +02:00
Petr Gusev	5f8664757a	storage_service: move get_ip_from_peers_table to system_keyspace We plan to add a cache to get_ip_from_peers_table in upcoming commits. It's more convenient to do this from system_keyspace, since the only two methods that mutate system.peers (remove_endpoint and update_peers_info) are already there. (cherry picked from commit `91c633371e`)	2025-09-01 11:21:55 +02:00
Calle Wilund	2e08d651a8	system_keyspace: Limit parallelism in drop_truncation_records Fixes #25682 Refs scylla-enterprise#5580 If the truncation table is large in entries, we might create a huge parallel execution, quite possibly consuming loads of resources doing something quite trivial. Limit concurrency to a small-ish number Closes scylladb/scylladb#25678 (cherry picked from commit `2eccd17e70`) Closes scylladb/scylladb#25751	2025-09-01 09:13:44 +03:00
Dawid Mędrek	7f58681482	db/commitlog: Extend error messages for corrupted data We're providing additional information in error messages when throwing an exception related to data corruption: when a segment is truncated and when it's content is invalid. That might prove helpful when debugging. Closes scylladb/scylladb#25190 (cherry picked from commit `408b45fa7e`) Closes scylladb/scylladb#25461	2025-09-01 09:08:29 +03:00
Calle Wilund	fe87af4674	commitlog: Ensure segment deletion is re-entrant Fixes #25709 If we have large allocations, spanning more than one segment, and the internal segment references from lead to secondary are the only thing keeping a segment alive, the implicit drop in discard_unused_segments and orphan_all can cause a recursive call to discard_unused_segments, which in turn can lead to vector corruption/crash, or even double free of segment (iterator confusion). Need to separate the modification of the vector (_segments) from actual releasing of objects. Using temporaries is the easiest solution. To further reduce recursion, we can also do an early clear of segment dependencies in callbacks from segment release (cf release). Closes scylladb/scylladb#25719 (cherry picked from commit `cc9eb321a1`) Closes scylladb/scylladb#25756	2025-08-30 18:50:47 +03:00
Dawid Mędrek	d12fdcaa75	db/hints: Add new logs We're adding new logs in just a few places that may however prove important when debugging issues in hinted handoff in the future. (cherry picked from commit `6f1fb7cfb5`)	2025-08-18 16:02:01 +02:00
Dawid Mędrek	325831afad	db/hints: Adjust log levels Some of the logs could be clogging Scylla's logs, so we demote their level to a lower one. On the other hand, some of the logs would most likely not do that, and they could be useful when debugging -- we promote them to debug level. (cherry picked from commit `d7bc9edc6c`)	2025-08-18 16:02:00 +02:00
Dawid Mędrek	7b212edd0c	db/hints: Improve logs Before these changes, the logs in hinted handoff often didn't provide crucial information like the identifier of the node that hints were being sent to. Also, some of the logs were misleading and referred to other places in the code than the one where an exception or some other situation really occurred. We modify those logs, extending them by more valuable information and fixing existing issues. What's more, all of the logs in `hint_endpoint_manager` and `hint_sender` follow a consistent format now: ``` <class_name>[<destination host ID>]:<function_name>: <message> ``` This way, we should always have AT LEAST the basic information. (cherry picked from commit `2327d4dfa3`)	2025-08-18 16:01:57 +02:00
Sergey Zolotukhin	9b7886ed71	generic_server: Two-step connection shutdown. When shutting down in `generic_server`, connections are now closed in two steps. First, only the RX (receive) side is shut down. Then, after all ongoing requests are completed, or a timeout happened the connections are fully closed. Fixes scylladb/scylladb#24481 (cherry picked from commit `ea311be12b`)	2025-08-18 15:46:46 +02:00
Patryk Jędrzejczak	4294669e72	db: system_keyspace: peers_table_read_fixup: remove rows with null host_id Currently, `peers_table_read_fixup` removes rows with no `host_id`, but not with null `host_id`. Null host IDs are known to appear in system tables, for example in `system.cluster_status` after a failed bootstrap. We better make sure we handle them properly if they ever appear in `system.peers`. This commit guarantees that null UUID cannot belong to `loaded_endpoints` in `storage_service::join_cluster`, which in particular ensures that we throw a runtime error when a user sets `recovery_leader` to null UUID during the recovery procedure. This is handled by the code verifying that `recovery_leader` belongs to `loaded_endpoints`. (cherry picked from commit `23f59483b6`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	74cf95a675	db/config, gms/gossiper: change recovery_leader to UUID We change the type of the `recovery_leader` config parameter and `gossip_config::recovery_leader` from sstring to UUID. `recovery_leader` is supposed to store host ID, so UUID is a natural choice. After changing the type to UUID, if the user provides an incorrect UUID, parsing `recovery_leader` will fail early, but the start-up will continue. Outside the recovery procedure, `recovery_leader` will then be ignored. In the recovery procedure, the start-up will fail on: ``` throw std::runtime_error( "Cannot start - Raft-based topology has been enabled but persistent group 0 ID is not present. " "If you are trying to run the Raft-based recovery procedure, you must set recovery_leader."); ``` (cherry picked from commit `445a15ff45`)	2025-08-05 10:59:39 +00:00
Patryk Jędrzejczak	d18d2fa0cf	db/config, utils: allow using UUID as a config option We change the `recovery_leader` option to UUID in the following commit. (cherry picked from commit `ec69028907`)	2025-08-05 10:59:39 +00:00
Ran Regev	7aa7f50b3a	scylla.yaml: add recommended value for stream_io_throughput_mb_per_sec Fixes: #24758 Updated scylla.yaml and the help for scylla --help Closes scylladb/scylladb#24793 (cherry picked from commit `db4f301f0c`) Closes scylladb/scylladb#25280	2025-08-01 15:02:01 +03:00
Avi Kivity	f3297824e3	Revert "config: decrease default large allocation warning threshold to 128k" This reverts commit `04fb2c026d`. 2025.3 got the reduced threshold, but won't get many of the fixes the warning will generate, leaving it very noisy. Better to avoid the noise for this release. Fixes #24384.	2025-07-10 14:12:14 +03:00
Michael Litvak	7b30f487dd	test: test_batchlog_manager: test batch replay when a node is down Add a test of the batchlog manager replay loop applying failed batches while some replica is down. The test reproduces an issue where the batchlog manager tries to replay a failed batch, doesn't get a response from some replica, and becomes stuck. It verifies that the batchlog manager can eventually recover from this situation and continue applying failed batches. (cherry picked from commit `a9b476e057`)	2025-07-08 06:25:36 +00:00
Michael Litvak	c3c489d3d4	batchlog_manager: set timeout on writes Set a timeout on writes of replayed batches by the batchlog manager. We want to avoid having infinite timeout for the writes in case it gets stuck for some unexpected reason. The timeout is set to be high enough to allow any reasonable write to complete. (cherry picked from commit `74a3fa9671`)	2025-07-08 06:25:36 +00:00
Michael Litvak	6fb6bb8dc7	batchlog_manager: abort writes on shutdown On shutdown of batchlog manager, abort all writes of replayed batches by the batchlog manager. To achieve this we set the appropriate write_type to BATCH, and on shutdown cancel all write handlers with this type. (cherry picked from commit `7150632cf2`)	2025-07-08 06:25:36 +00:00
Michael Litvak	02c038efa8	batchlog_manager: create cancellable write response handler When replaying a batch mutation from the batchlog manager and sending it to all replicas, create the write response handler as cancellable. To achieve this we define a new wrapper type for batchlog mutations - batchlog_replay_mutation, and this allows us to overload create_write_response_handler for this type. This is similar to how it's done with hint_wrapper and read_repair_mutation. (cherry picked from commit `fc5ba4a1ea`)	2025-07-08 06:25:36 +00:00
Avi Kivity	b33dd2bd7d	Merge 'sstables/mx/writer: handle non-full prefix row keys' from Botond Dénes Although valid for compact tables, non-full (or empty) clustering key prefixes are not handled for row keys when writing sstables. Only the present components are written, consequently if the key is empty, it is omitted entirely. When parsing sstables, the parsing code unconditionally parses a full prefix. This mis-match results in parsing failures, as the parser parses part of the row content as a key resulting in a garbage key and subsequent mis-parsing of the row content and maybe even subsequent partitions. Introduce a new system table: `system.corrupt_data` and infrastructure similar to `large_data_handler`: `corrupt_data_handler` which abstracts how corrupt data is handled. The sstable writer now passes rows such corrupt keys to the corrupt data handler. This way, we avoid corrupting the sstables beyond parsing and the rows are also kept around in system.corrupt_data for later inspection and possible recovery. Add a full-stack test which checks that rows with bad keys are correctly handled. Fixes: https://github.com/scylladb/scylladb/issues/24489 The bug is present in all versions, has to be backported to all supported versions. Closes scylladb/scylladb#24492 * github.com:scylladb/scylladb: test/boost/sstable_datafile_test: add test for corrupt data sstables/mx/writer: handler rows with empty keys test/lib/cql_assertions: introduce columns_assertions sstables: add corrupt_data_handler to sstables::sstables tools/scylla-sstable: make large_data_handler a local db: introduce corrupt_data_handler mutation: introduce frozen_mutation_fragment_v2 mutation/mutation_partition_view: read_{clustering,static}_row(): return row type mutation/mutation_partition_view: extract de-ser of {clustering,static} row idl-compiler.py: generate skip() definition for enums serializers idl: extract full_position.idl from position_in_partition.idl db/system_keyspace: add apply_mutation() db/system_keyspace: introduce the corrupt_data table	2025-06-29 18:18:36 +03:00
Ferenc Szili	96267960f8	logging: Add row count to large partition warning message When writing large partitions, that is: partitions with size or row count above a configurable threshold, ScyllaDB outputs a warning to the log: WARN ... large_data - Writing large partition test/test: (1200031 bytes) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db This warning contains the information about the size of the partition, but it does not contain the number of rows written. This can lead to confusion because in cases where the warning was written because of the row count being larger than the threshold, but the partition size is below the threshold, the warning will only contain the partition size in bytes, leading the user to believe the warning was output because of the partition size, when in reality it was the row count that triggered the warning. See #20125 This change adds a size_desc argument to cql_table_large_data_handler::try_record(), which will contain the description of the size of the object written. This method is used to output warnings for large partitions, row counts, row sizes and cell sizes. This change does not modify the warning message for row and cell sizes, only for partition size and row count. The warning for large partitions and row counts will now look like this: WARN ... large_data - Writing large partition test/test: (1200031 bytes/100001 rows) to me-3glr_0xkd_54jip2i8oqnl7hk8mu-big-Data.db Closes scylladb/scylladb#22010	2025-06-26 12:25:38 +02:00
Botond Dénes	3e1c50e9a7	db: introduce corrupt_data_handler Similar to large_data_handler, this interface allows sstable writers to delegate the handling of corrupt data. Two implementations are provided: * system_table_corrupt_data_handler - saved corrupt data in system.corrupt_data, with a TTL=10days (non-configurable for now) * nop_corrupt_data_handler - drops corrupt data	2025-06-24 14:57:00 +03:00
Botond Dénes	0753643606	db/system_keyspace: add apply_mutation() Allow applying writes in the form of mutations directly to the keyspace. Allows lower-level mutation API to build writes. Advantageous if writes can contain large cells that would otherwise possibly cause large allocation warnings if used via the internal CQL API.	2025-06-24 11:05:30 +03:00
Botond Dénes	92b5fe8983	db/system_keyspace: introduce the corrupt_data table To serve as a place to store corrupt mutation fragments. These fragments cannot be written to sstables, as they would be spread around by compaction and/or repair. They even might make parsing the sstable impossible. So they are stored in this special table instead, kept around to be inspected later and possibly restored if possible.	2025-06-24 11:05:30 +03:00
Patryk Jędrzejczak	6489308ebc	Merge 'Introduce a queue of global topology requests.' from Gleb Natapov Currently only one global topology request (such as truncate, cdc repair, cleanup and alter table) can be pending. If one is already pending others will be rejected with an error. This is not very user friendly, so this series introduces a queue of global requests which allows queuing many global topology requests simultaneously. Fixes: #16822 No need to backport since this is a new feature. Closes scylladb/scylladb#24293 * https://github.com/scylladb/scylladb: topology coordinator: simplify truncate handling in case request queue feature is disable topology coordinator: fix indentation after the previous patch topology coordinator: allow running multiple global commands in parallel topology coordinator: Implement global topology request queue topology coordinator: Do not cancel global requests in cancel_all_requests topology coordinator: store request type for each global command topology request: make it possible to hold global request types in request_type field topology coordinator: move alter table global request parameters into topology_request table topology coordinator: move cleanup global command to report completion through topology_request table topology coordinator: no need to create updates vector explicitly topology coordinator: use topology_request_tracking_mutation_builder::done() instead of open code it topology coordinator: handle error during new_cdc_generation command processing topology coordinator: remove unneeded semicolon topology coordinator: fix indentation after the last commit topology coordinator: move new_cdc_generation topology request to use topology_request table for completion gms/feature_service: add TOPOLOGY_GLOBAL_REQUEST_QUEUE feature flag	2025-06-23 16:08:09 +03:00
Asias He	c5a136c3b5	storage_service: Use utils::chunked_vector to avoid big allocation The following was seen: ``` !WARNING \| scylla[6057]: [shard 12:strm] seastar_memory - oversized allocation: 212992 bytes. This is non-fatal, but could lead to latency and/or fragmentation issues. Please report: at [Backtrace #0] void seastar::backtrace<seastar::current_backtrace_tasklocal()::$_0>(seastar::current_backtrace_tasklocal()::$_0&&, bool) at ./build/release/seastar/./seastar/include/seastar/util/backtrace.hh:89 (inlined by) seastar::current_backtrace_tasklocal() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:99 seastar::current_tasktrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:136 seastar::current_backtrace() at ./build/release/seastar/./build/release/seastar/./seastar/src/util/backtrace.cc:169 seastar::memory::cpu_pages::warn_large_allocation(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:848 seastar::memory::allocate_slowpath(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:911 operator new(unsigned long) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/memory.cc:1706 std::allocator<dht::token_range_endpoints>::allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/allocator.h:196 (inlined by) std::allocator_traits<std::allocator<dht::token_range_endpoints> >::allocate(std::allocator<dht::token_range_endpoints>&, unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/alloc_traits.h:515 (inlined by) std::_Vector_base<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_allocate(unsigned long) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:380 (inlined by) void std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> >::_M_realloc_append<dht::token_range_endpoints const&>(dht::token_range_endpoints const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/vector.tcc:596 locator::describe_ring(replica::database const&, gms::gossiper const&, seastar::basic_sstring<char, unsigned int, 15u, true> const&, bool) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/stl_vector.h:1294 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<std::vector<dht::token_range_endpoints, std::allocator<dht::token_range_endpoints> > >::promise_type::run_and_dispose() at ././seastar/include/seastar/core/coroutine.hh:80 seastar::reactor::do_run() at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:2635 std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at ./build/release/seastar/./build/release/seastar/./seastar/src/core/reactor.cc:4684 ``` Fix by using chunked_vector. Fixes #24158 Closes scylladb/scylladb#24561	2025-06-19 16:51:01 +03:00
Avi Kivity	cd79a8fc25	Revert "Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz" This reverts commit `0b516da95b`, reversing changes made to `30199552ac`. It breaks cluster.random_failures.test_random_failures.test_random_failures in debug mode (at least). Fixes #24513	2025-06-16 22:38:12 +03:00
Tomasz Grabiec	cdb1499898	Merge 'interval: reduce memory footprint' from Avi Kivity The interval class's memory footprint isn't important for single objects, but intervals are frequently held in moderately sized collections. In #3335 this caused a stall. Therefore reducing interval's memory footprint and reduce allocation pressure. This series does this by consolidating badly-padded booleans in the object tree spanned by interval into 5 booleans that are consecutive in memory. This reduces the space required by these booleans from 40 bytes to 8 bytes. perf-simple-query report (with refresh-pgo-profiles.sh for each measurement): before: 252127.60 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37128 insns/op, 18147 cycles/op, 0 errors) INFO 2025-06-07 21:00:34,010 [shard 0:main] group0_tombstone_gc_handler - Setting reconcile time to 1749319231 (min id=4dbed2f4-43c9-11f0-cbc6-87d1a08b4ca4) 246492.37 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37153 insns/op, 18411 cycles/op, 0 errors) 253633.11 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37127 insns/op, 17941 cycles/op, 0 errors) 254029.93 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37155 insns/op, 17951 cycles/op, 0 errors) 254465.76 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37123 insns/op, 17906 cycles/op, 0 errors) throughput: mean= 252149.75 standard-deviation=3282.75 median= 253633.11 median-absolute-deviation=1880.17 maximum=254465.76 minimum=246492.37 instructions_per_op: mean= 37137.24 standard-deviation=15.71 median= 37127.54 median-absolute-deviation=14.45 maximum=37155.24 minimum=37122.79 cpu_cycles_per_op: mean= 18071.19 standard-deviation=212.25 median= 17950.62 median-absolute-deviation=130.10 maximum=18411.50 minimum=17906.13 after: 252561.26 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37039 insns/op, 18075 cycles/op, 0 errors) 256876.44 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37022 insns/op, 17785 cycles/op, 0 errors) 257084.38 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37030 insns/op, 17840 cycles/op, 0 errors) 257305.35 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37042 insns/op, 17804 cycles/op, 0 errors) 258088.53 tps ( 66.1 allocs/op, 0.0 logallocs/op, 14.1 tasks/op, 37028 insns/op, 17778 cycles/op, 0 errors) throughput: mean= 256383.19 standard-deviation=2185.22 median= 257084.38 median-absolute-deviation=922.16 maximum=258088.53 minimum=252561.26 instructions_per_op: mean= 37032.17 standard-deviation=8.06 median= 37030.46 median-absolute-deviation=6.44 maximum=37041.83 minimum=37021.93 cpu_cycles_per_op: mean= 17856.60 standard-deviation=124.70 median= 17804.16 median-absolute-deviation=71.24 maximum=18075.50 minimum=17777.95 A small improvement is observed in instructions_per_op. It could be random fluctuations in the compiler performance, or maybe the default constructor/destructor of interval are meaningful even in this simple test. Small performance improvement, so not a backport candidate. Closes scylladb/scylladb#24232 * github.com:scylladb/scylladb: interval: reduce sizeof interval: change start()/end() not to return references to data members interval: rename start_ref() back to start() (and end_ref() etc). interval: rename start() to start_ref() (and end() etc). test: wrapping_interval_test: add more tests for intervals	2025-06-16 09:23:56 +02:00
Botond Dénes	898ce98500	db/batchlog_manager: remove unused member _total_batches_replayed And its getter. There are no users for either. Closes scylladb/scylladb#24416	2025-06-16 09:37:00 +03:00
Avi Kivity	16fb68bb5e	interval: rename start_ref() back to start() (and end_ref() etc). To reduce noise, rename start_ref() back to its original name start(), after it was changed in the previous patch to force an audit of all calls.	2025-06-14 21:26:16 +03:00
Avi Kivity	3363bc41e2	interval: rename start() to start_ref() (and end() etc). We are about to change start() to return a proxy object rather than a `const interval_bound<T>&`. This is generally transparent, except in one case: `auto x = i.start()`. With the current implementation, we'll copy object referred to and assign it to x. With the planned implementation, the proxy object will be assigned to `x`, but it will keep referring to `i`. To prevent such problems, rename start() to start_ref() and end() to end_ref(). This forces us to audit all calls, and redirect calls that will break to new start_copy() and end_copy() methods.	2025-06-14 21:26:16 +03:00
Gleb Natapov	a0a3a034e0	topology coordinator: Implement global topology request queue Requests, together with their parameters, are added to the topology_request tables and the queue of active global requests is kept in topology state. Thy are processed one by one by the topology state machine. Fixes: #16822	2025-06-11 11:29:33 +03:00
Tomasz Grabiec	0b516da95b	Merge 'Atomic in-memory schema changes application' from Marcin Maliszkiewicz This change is preparing ground for state update unification for raft bound subsystems. It introduces schema_applier which in the future will become generic interface for applying mutations in raft. Pulling `database::apply()` out of schema merging code will allow to batch changes to subsystems. Future generic code will first call `prepare()` on all implementations, then single `database::apply()` and then `update()` on all implementations, then on each shard it will call `commit()` for all implementations, without preemption so that the change is observed as atomic across all subsystems, and then `post_commit()`. Backport: no, it's a new feature Fixes: https://github.com/scylladb/scylladb/issues/19649 Closes scylladb/scylladb#20853 * github.com:scylladb/scylladb: storage_service: always wake up load balancer on update tablet metadata db: schema_applier: call destroy also when exception occurs db: replica: simplify seeding ERM during shema change db: remove cleanup from add_column_family db: abort on exception during schema commit phase db: make user defined types changes atomic replica: db: make keyspace schema changes atomic db: atomically apply changes to tables and views replica: make truncate_table_on_all_shards get whole schema from table_shards service: split update_tablet_metadata into two phases service: pull out update_tablet_metadata from migration_listener db: service: add store_service dependency to schema_applier service: simplify load_tablet_metadata and update_tablet_metadata db: don't perform move on tablet_hint reference replica: split add_column_family_and_make_directory into steps replica: db: split drop_table into steps db: don't move map references in merge_tables_and_views() db: introduce commit_on_shard function db: access types during schema merge via special storage replica: make non-preemptive keyspace create/update/delete functions public replica: split update keyspace into two phases replica: split creating keyspace into two functions db: rename create_keyspace_from_schema_partition db: decouple functions and aggregates schema change notification from merging code db: store functions and aggregates change batch in schema_applier db: decouple tables and views schema change notifications from merging code db: store tables and views schema diff in schema_applier db: decouple user type schema change notifications from types merging code service: unify keyspace notification functions arguments db: replica: decouple keyspace schema change notifications to a separate function db: add class encapsulating schema merging	2025-06-10 13:45:32 +02:00
Gleb Natapov	00fd427be0	topology request: make it possible to hold global request types in request_type field topology_request table has a filed to hold a request type, but currently it can hold only per node requests. This patch makes it possible to store global request types there as well.	2025-06-09 13:38:49 +03:00
Gleb Natapov	3a496067c6	topology coordinator: move alter table global request parameters into topology_request table Currently parameters to alter table global topology command are stored in static column in the topology table, but this way there can be only one outstanding alter table request. This patch moves the parameters to the topology_request table where parameters are stored per request.	2025-06-09 13:38:49 +03:00
Michał Chojnowski	7d26d3c7cb	db/config: add an option that disables dict-aware sstable compressors in DDL statements For reasons, we want to be able to disallow dictionary-aware compressors in chosen deployments. This patch adds a knob for that. When the knob is disabled, dictionary-aware compressors will be rejected in the validation stage of CREATE and ALTER statements. Closes scylladb/scylladb#24355	2025-06-09 13:30:40 +03:00
Marcin Maliszkiewicz	ddc0656eb5	db: schema_applier: call destroy also when exception occurs Otherwise objects may be destroyed on wrong shard, and assert will trigger in ~sharded().	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	547bb1f663	db: replica: simplify seeding ERM during shema change We know that caller is running on shard 0 so we can avoid some extra boilerplate.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	97cdb72d4d	db: remove cleanup from add_column_family Since we abort now on failure during schema commit there is no need for cleanup as it only manages in-memory state. Explicit cf.stop was added to code paths outside of schema merging to avoid unnecessary regressions.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	d5075c70ef	db: abort on exception during schema commit phase As we have no way to recover from partial commit.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	858db822dc	db: make user defined types changes atomic The same order of creation/destruction is preserved as in the original code, looking from single shard point of view. create_types() is called on each shard separately, while in theory we should be able reuse results similarly as diff_rows(). But we don't introduce this optimization yet.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	5b2e4140cc	replica: db: make keyspace schema changes atomic Now all keyspace related schema changes are observable on given shard as they would be applied atomically. This is achieved by commit_on_shard() function being non-preemptive (no futures, no co_awaits). In the future we'll extend this to the whole schema and also other subsystems.	2025-06-06 08:50:34 +02:00
Marcin Maliszkiewicz	556e89bc9d	db: atomically apply changes to tables and views In this commit we make use of splitted functions introduced before. Pattern is as follows: - in merge_tables_and_views we call some preparatory functions - in schema_applier::update we call non-yielding step - in schema_applier::post_commit we call cleanups and other finalizing async functions Additionally we introduce frozen_schema_diff because converting schema_ptr to global_schema_ptr triggers schema registration and with atomic changes we need to place registration only in commit phase. Schema freezing is the same method global_schema_ptr uses to transport schema across shards (via schema_registry cache).	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	21a5a3c01f	service: pull out update_tablet_metadata from migration_listener It's not a good usage as there is only one non-empty implementation. Also we need to change it further in the following commit which makes it incompatible with listener code.	2025-06-06 08:50:33 +02:00
Marcin Maliszkiewicz	92e3d69f79	db: service: add store_service dependency to schema_applier There is already implicit logical dependency via migration_notifier but in the next commits we'll be moving store_service out from it as we need better control (i.e. return a value from the call).	2025-06-06 08:50:33 +02:00

1 2 3 4 5 ...

4362 Commits