scylladb

Author	SHA1	Message	Date
Botond Dénes	846a6e700b	Merge 'get_snapshot_details: process also staging directory' from Benny Halevy Currently, we determine the live vs. total snapshot size by listing all files in the snapshot directory, and for each name, look it up in the base table directory and see if it exists there, and if so, if it's the same file as in the snapshot by looking to the fstat data for the dev id and inode number. However, we do not look the names in the staging directory so staging sstable would skew the results as the will falsely contribute to the live size, since they wouldn't be found in the base directory. This change processes both the staging directory and base table directory and keeps the file capacity in a map, indexed by the files inode number, allowing us to easily detect hard links and be resilient against concurrent move of files from the staging sub-directory back into the base table directory. Fixes #27635 * Minor issue, no backport required Closes scylladb/scylladb#27636 * github.com:scylladb/scylladb: table: get_snapshot_details: add FIXME comments table: get_snapshot_details: lookup entries also in the staging directory table: get_snapshot_details: optimize using the entry number_of_links table: get_snapshot_details: continue loop for manifest and schema entries table: get_snapshot_details: use directory_lister	2025-12-22 20:02:40 +02:00
Dario Mirovic	f89315d02f	replica: database: flush_all_tables log on completion In database::flush_all_tables add log on completion. This slightly improves the readability of logs when debugging an issue. Refs #26932	2025-12-18 12:54:42 +01:00
Benny Halevy	798714183e	table: get_snapshot_details: add FIXME comments Ref https://github.com/scylladb/seastar/pull/3163 We can optimize the stat calls we use here by using open_directory to open the snapshot, base, and staging directory once, and using statat calls for the relative name instead of the full blown file_stat that needs to traverse the whole path prefix for every call (the dirents are likely to be cached, but still why waste cpu cycles on that over and over again). Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:45:56 +02:00
Benny Halevy	f5ca3657e2	table: get_snapshot_details: lookup entries also in the staging directory Since the sstables in the snapshot may still be in the staging dir. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	dc00461adf	table: get_snapshot_details: optimize using the entry number_of_links If the number_of_linkes equals 1, we can be sure that the file exists only in the snapshot directory so there is no need to look it up in the data directory. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	be6d87648c	table: get_snapshot_details: continue loop for manifest and schema entries Now that we're using a simple loop in the coroutine just continue the loop for files we want to ignore. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	004c08f525	table: get_snapshot_details: use directory_lister It is more efficient to use the coroutine generator to list the directory. Brewing changes in seastar would make the generator buffered as well as adding an extended generation that would return the file stat data for each entry, that would become useful in the next patch that optimizes the algorithm by considering the entry's link count. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-16 18:42:05 +02:00
Benny Halevy	f4a4671ad6	table: seal_snapshot: avoid oversized allocation when dumping manifest.json Currently, we first print the json contents into a stringstream buffer and then we write it as a whole to the manifest.json file output stream. This is not scalable and may cause large allocation for large enough number of files. Fixes #24216 Signed-off-by: Benny Halevy <bhalevy@scylladb.com> Closes scylladb/scylladb#27542	2025-12-15 15:19:24 +03:00
Pavel Emelyanov	3f7ee3ce5d	Merge 'batchlog: make replay (flush) faster' from Botond Dénes The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied. When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table. Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed). Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones. Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range. When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC. The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions. This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR: CREATE TABLE system.batchlog_v2 ( version int, stage int, shard int, written_at timestamp, id uuid, data blob, PRIMARY KEY ((version, stage, shard), written_at, id)); The new schema organization has the following goals: 1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC. 2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely. 3) Spread out the data among Scylla shards, via the batchlog shard column. 4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed. Fixes: https://github.com/scylladb/scylladb/issues/23358 This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`. Closes scylladb/scylladb#26671 * github.com:scylladb/scylladb: db/config: change batchlog_replay_cleanup_after_replays default to 1 test/boost/batchlog_manager_test: add test for batchlog cleanup replica/mutation_dump: always set position weight for clustering positions service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/ test/lib: introduce error_injection.hh utils/error_injection: add debug log to disable() and disable_all() test/lib/cql_test_env: forward config to batchlog test/lib/cql_test_env: add batch type to execute_batch() test/lib/cql_assertions: add with_size(predicate) overload test/lib/cql_assertions: add source location to fail messages test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row() test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload db/batchlog_manager: config: s/write_timeout/reply_timeot/ db,service: switch to system.batchlog_v2 db/system_keyspace: introduce system.batchlog_v2 service,db: extract generation of batchlog delete mutation service,db: extract get_batchlog_mutation_for() from storage-proxy db/batchlog_manager: only consider propagation delay with tombstone-gc=repair db/batchlog_manager: don't drop entire batch if one mutations' table was dropped data_dictionary: table: add get_truncation_time() db/batchlog_manager: batch(): replace map_reduce() with simple loop db/batchlog_manager: finish coroutinizing replay_all_failed_batches db/batchlog_manager: improve replayAllFailedBatches logs	2025-12-15 15:05:19 +03:00
copilot-swe-agent[bot]	77ee7f3417	Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy" This reverts commit `8192f45e84`. The merge exposed a bug where truncate (via drop) fails and causes Raft errors, leading to schema inconsistencies across nodes. This results in test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist' errors. The specific problematic change was in commit `19b6207f` which modified truncate_table_on_all_shards to set use_sstable_identifier = true. This causes exceptions during truncate that are not properly handled, leading to Raft applier fiber stopping and nodes losing schema synchronization.	2025-12-12 03:55:13 +00:00
Tomasz Grabiec	0e51a1f812	replica: Remove unnecessary noexcept Can potentially lead to unnecessary abort. compaction_groups() and for_each_compaction_group() can throw. Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:51:35 +01:00
Tomasz Grabiec	8b807b299e	replica: Remove noexcept from compaction_groups() functions They can throw during merge, when the number of compaction groups is higher than 3. Callers can deal with that, so we shouldn't abort.	2025-12-10 14:48:23 +01:00
Tomasz Grabiec	07ff659849	replica: Remove noexcept from storage_group::for_each_compaction_group They don't really have to be noexcept. And "action" may actually throw, leading to abort. It was observed to throw when creating memtable readers: terminate called after throwing an instance of 'utils::memory_limit_reached' what(): kill limit triggered on semaphore sl:users by permit xxx Aborting on shard 4, in scheduling group sl:users. std::terminate() at ??:0 __clang_call_terminate at main.cc:0 replica::storage_group::for_each_compaction_group(std::function<void (seastar::lw_shared_ptr<replica::compaction_group> const&)>) const at ./replica/table.cc:920 (inlined by) replica::table::add_memtables_to_reader_list(std::vector<mutation_reader, std::allocator<mutation_reader>>&, seastar::lw_shared_ptr<schema const> const&, reader_permit const&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr const&, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>, std::function<void (unsigned long)>) const at ./replica/table.cc:196 (inlined by) replica::table::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:243 (inlined by) replica::table::as_mutation_source() const::$_0::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ./replica/table.cc:3673 (inlined by) mutation_reader std::__invoke_impl<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(std::__invoke_other, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>, mutation_reader>::type std::__invoke_r<mutation_reader, replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>>(replica::table::as_mutation_source() const::$_0&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:114 (inlined by) std::_Function_handler<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>), replica::table::as_mutation_source() const::$_0>::_M_invoke(std::_Any_data const&, seastar::lw_shared_ptr<schema const>&&, reader_permit&&, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr&&, seastar::bool_class<streamed_mutation::forwarding_tag>&&, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>&&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 (inlined by) std::function<mutation_reader (seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>)>::operator()(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 (inlined by) mutation_source::make_reader_v2(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position> const&, query::partition_slice const&, tracing::trace_state_ptr, seastar::bool_class<streamed_mutation::forwarding_tag>, seastar::bool_class<mutation_reader::partition_range_forwarding_tag>) const at ././readers/mutation_source.hh:143 query::querier_base::querier_base(seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, mutation_source const&, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:91 (inlined by) query::querier::querier(mutation_source const&, seastar::lw_shared_ptr<schema const>, reader_permit, interval<dht::ring_position>, query::partition_slice, tracing::trace_state_ptr, query::querier_base::querier_config) at ././querier.hh:164 (inlined by) replica::table::query(seastar::lw_shared_ptr<schema const>, reader_permit, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, query::result_memory_limiter&, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::optional<query::querier>) at ./replica/table.cc:3583 replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0::operator()(reader_permit) const at ./replica/database.cc:1533 (inlined by) seastar::noncopyable_function<seastar::future<void> (reader_permit)>::indirect_vtable_for<replica::database::query(seastar::lw_shared_ptr<schema const>, query::read_command const&, query::result_options, std::vector<interval<dht::ring_position>, std::allocator<interval<dht::ring_position>>> const&, tracing::trace_state_ptr, std::chrono::time_point<seastar::lowres_clock, std::chrono::duration<long, std::ratio<1l, 1000000000l>>>, std::variant<std::monostate, db::per_partition_rate_limit::account_only, db::per_partition_rate_limit::account_and_enforce>)::$_0>::call(seastar::noncopyable_function<seastar::future<void> (reader_permit)> const, reader_permit) (.llvm.13537529942037499926) at ././seastar/include/seastar/util/noncopyable_function.hh:158 seastar::noncopyable_function<seastar::future<void> (reader_permit)>::operator()(reader_permit) const at ././seastar/include/seastar/util/noncopyable_function.hh:215 (inlined by) reader_concurrency_semaphore::execution_loop() (.resume) at ./reader_concurrency_semaphore.cc:980 std::__n4861::coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/coroutine:242 (inlined by) seastar::internal::coroutine_traits_base<void>::promise_type::run_and_dispose() at ./build/release/seastar/./seastar/include/seastar/core/coroutine.hh:122 (inlined by) seastar::reactor::run_tasks(seastar::reactor::task_queue&) at ./build/release/seastar/./seastar/src/core/reactor.cc:2627 (inlined by) seastar::reactor::run_some_tasks() at ./build/release/seastar/./seastar/src/core/reactor.cc:3099 seastar::reactor::do_run() at ./build/release/seastar/./seastar/src/core/reactor.cc:3267 seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0::operator()() const at ./build/release/seastar/./seastar/src/core/reactor.cc:4591 (inlined by) void std::__invoke_impl<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(std::__invoke_other, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:61 (inlined by) std::enable_if<is_invocable_r_v<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>, void>::type std::__invoke_r<void, seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&>(seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/invoke.h:111 (inlined by) std::_Function_handler<void (), seastar::smp::configure(seastar::smp_options const&, seastar::reactor_options const&)::$_0>::_M_invoke(std::_Any_data const&) at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:290 std::function<void ()>::operator()() const at /usr/lib/gcc/x86_64-redhat-linux/14/../../../../include/c++/14/bits/std_function.h:591 Fixes #27475 Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>	2025-12-10 14:48:11 +01:00
Piotr Dulikowski	386309d6a0	Merge 'Improve the way distributed-loader constructs storage_options for backup sstables' from Pavel Emelyanov The distributed_loader::get_sstables_from_object_store() method accepts an endpoint parameter and internally wants to get storage type for that endpoint (s3 or gcs). This is needed to construct storage_options object to create an sstable object. To get the type, the method scans db::config option, but there's much simpler way to get one. Code cleanup, no need to backport Closes scylladb/scylladb#27381 * github.com:scylladb/scylladb: sstables_loader: Provide endpoint type for get_sstables_from_object_store() storage_manager: Introduce get_endpoint_type() method storage_manager: Split get_endpoint_client()	2025-12-08 16:55:20 +01:00
Pavel Emelyanov	8192f45e84	Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier. When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory and the manifest.json file, rather than using the sstable generation. This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable may be migrated across shards or across nodes, and in this case, its generation may change, but its sstable identifier remains sstable. Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to object storage and exist in previous backed up snapshots. Historically, the sstable generation was guaranteed to be unique only per table per node, so the dedup code currently checks for deduplication in the node scope. However, with tablet migration, sstables are renamed when migrated to a different shard, i.e. their generation changes, and they may be renamed when migrated to another node, but even if they are not, the dedup logic still assumes uniqueness only within a node. To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since `3a12ad96c7`). Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables in a wider scope. This can be cluster-wide, but we practically need only rack-wide deduplication or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a numerical replication factor to a rack list containing a subset of the available racks in a datacenter). Fixes #27181 * New feature, no backport required Closes scylladb/scylladb#27184 * github.com:scylladb/scylladb: database: truncate_table_on_all_shards: set use_sstable_identifier to true nodetool: snapshot: add --use-sstable-identifier option api: storage_service: take_snapshot: add use_sstable_identifier option test: database_test: add snapshot_use_sstable_identifier_works test: database_test: snapshot_works: add validate_manifest sstable: write_scylla_metadata: add random_sstable_identifier error injection table: snapshot_on_all_shards: take snapshot_options sstable: add get_format getter sstable: snapshot: add use_sstable_identifier option db: snapshot_ctl: snapshot_options: add use_sstable_identifier options db: snapshot_ctl: move skip_flush to struct snapshot_options	2025-12-08 12:56:12 +03:00
Avi Kivity	9696ee64d0	database: fix overflow when computing data distribution over shards We store the per-shard chunk count in a uint64_t vector global_offset, and then convert the counts to offsets with a prefix sum: ```c++ // [1, 2, 3, 0] --> [0, 1, 3, 6] std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus()); ``` However, std::exclusive_scan takes the accumulator type from the initial value, 0, which is an int, instead of from the range being iterated, which is of uint64_t. As a result, the prefix sum is computed as a 32-bit integer value. If it exceeds 0x8000'0000, it becomes negative. It is then extended to 64 bits and stored. The result is a huge 64-bit number. Later on we try to find an sstable with this chunk and fail, crashing on an assertion. An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57 The fix is simple: the initial value is passed as uint64_t instead of int. Fixes https://github.com/scylladb/scylladb/issues/27417 Closes scylladb/scylladb#27418	2025-12-04 14:10:53 +01:00
Benny Halevy	19b6207f17	database: truncate_table_on_all_shards: set use_sstable_identifier to true To facilitate global sstable deduplication on backup. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:57:39 +02:00
Benny Halevy	9b3fbedc8c	table: snapshot_on_all_shards: take snapshot_options And pass the use_sstable_identifier down the stack to the sstables layer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 11:55:50 +02:00
Benny Halevy	c18133b6cb	db: snapshot_ctl: move skip_flush to struct snapshot_options Prepare for adding another option: use_sstable_identifer. Signed-off-by: Benny Halevy <bhalevy@scylladb.com>	2025-12-04 09:46:35 +02:00
Botond Dénes	fb84b30f88	replica/mutation_dump: always set position weight for clustering positions SELECT * FROM MUTATION_FRAGMENTS() queries have a transformed schema (mutation-fragment schema), which is a superset of that of the queried table's. The mutation fragment schema represents position_in_partition of mutation fragments expressed as clustering columns. This presents some challenges, as some position_in_partition fields are null for some positions. This was solved by setting these clustering keys components to bytes{}. In the process, a mistake was made: when the clustering key is missing in the position_in_partition, the position_weight is also set to bytes{}. This is not correct, it is possible for some positions to have no key but to still have a position_weight. An example is position_in_partition::before_all_clustered_rows(). Fix this by always filling in the position_weight for positions which have region() == clustered, instead of the earlier condition on the key presence. This is a minor bug affecting range tombstone changes at the two extremes: position_in_partition::{before,after}_all_clustered_rows(). In both cases, the position_weight can be deduced by a human looking at the results, based on the position of the range tombstone change, relative to other fragments.	2025-12-02 14:21:26 +02:00
Botond Dénes	55704908a0	data_dictionary: table: add get_truncation_time() So the batchlog manager can avoid looking up the real table and instead just work with data dictionary.	2025-12-02 14:21:25 +02:00
Pavel Emelyanov	6c115c691f	sstables_loader: Provide endpoint type for get_sstables_from_object_store() Currently the method scans db::config to find one. It has some drawbacks. First, it's not very nice. Second, it needs to handle the case when the endpoint is missing, while it relally never is. Third, the type in config entry is not necessarily set. It's nicer to get the type from storage manager. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-12-02 11:18:32 +03:00
Botond Dénes	384bffb8da	Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho This PR adds support for limiting the maximum shares allocated to a compaction scheduling class by the compaction controller. It introduces a new configuration parameter, compaction_max_shares, which, when set to a non zero value, will cap the shares allocated to compaction jobs. This PR also exposes the shares computed by the compaction controller via metrics, for observability purposes. Fixes https://github.com/scylladb/scylladb/issues/9431 Enhancement. No need to backport. NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696 Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50) Closes scylladb/scylladb#27024 * github.com:scylladb/scylladb: db/config: introduce new config parameter `compaction_max_shares` compaction_manager:config: introduce max_shares compaction_controller: add configurable maximum shares compaction_controller: introduce `set_max_shares()`	2025-11-26 06:51:30 +02:00
Lakshmi Narayanan Sreethar	853811be90	compaction_controller: introduce `set_max_shares()` Add a method to dynamically adjust the maximum output of control points in the compaction controller. This is required for supporting runtime configuration of the maximum shares allocated to the compaction process by the controller. Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>	2025-11-24 11:43:20 -03:00
Aleksandra Martyniuk	19a7d8e248	replica: database: change type of tables_metadata::_ks_cf_to_uuid If there is a lot of tables, a node reports oversized allocation in _ks_cf_to_uuid of type flat_hash_map. Change the type to std::unordered_map to prevent oversized allocations. Fixes: https://github.com/scylladb/scylladb/issues/26787. Closes scylladb/scylladb#27165	2025-11-24 06:42:40 +02:00
Radosław Cybulski	d589e68642	Add precompiled headers to CMakeLists.txt Add precompiled header support to CMakeLists.txt and configure.py - it improves compilation time by approximately 10%. New header `stdafx.hh` is added, don't include it manually - the compiler will include it for you. The header contains includes from external libraries used by Scylla - seastar, standard library, linux headers and zlib. The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER` or configure.py --disable-precompiled-header to disable. The feature should be disabled, when trying to check headers - otherwise you might get false negatives on missing includes from seastar / abseil and so on. Note: following configuration needs to be added to ccache.conf: sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime Closes scylladb/scylladb#26617	2025-11-21 12:27:41 +02:00
Botond Dénes	0cc5208f8e	Merge 'Add sstables_manager::config' from Pavel Emelyanov Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well. Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls. Shuffling components dependencies, no need to backport Closes scylladb/scylladb#27021 * github.com:scylladb/scylladb: sstables_manager: Drop db::config from sstables_manager tools/sstable: Make shard_of_with_tablets use db::config argument tools/sstable: Add db::config& to all operations tools/sstable: Get endpoints from storage manager sstables_manager: Hold sstable IO extensions on it sstables: Manager helper to grab file io extensions sstables_manager: Move default format on config sstables_manager: Move enable_sstable_data_integrity_check on config sstables_manager: Move data_file_directories on config sstables_manager: Move components_memory_reclaim_threshold on config sstables_manager: Move column_index_auto_scale_threshold on config sstables_manager: Move column_index_size on config sstables_manager: Move sstable_summary_ratio on config sstables_manager: Move enable_sstable_key_validation on config sstables_manager: Move available_memory on config code: Introduce sstables_manager::config sstables: Patch get_local_directories() to work on vector of paths code: Rename sstables_manager::config() into db_config()	2025-11-21 10:21:41 +02:00
Raphael S. Carvalho	74ecedfb5c	replica: Fail timed-out single-key read on cleaned up tablet replica Consider the following: 1) single-key read starts, blocks on replica e.g. waiting for memory. 2) the same replica is migrated away 3) single-key read expires, coordinator abandons it, releases erm. 4) migration advances to cleanup stage, barrier doesn't wait on timed-out read 5) compaction group of the replica is deallocated on cleanup 6) that single-key resumes, but doesn't find sstable set (post cleanup) 7) with abort-on-internal-error turned on, node crashes It's fine for abandoned (= timed out) reads to fail, since the coordinator is gone. For active reads (non timed out), the barrier will wait for them since their coordinator holds erm. This solution consists of failing reads which underlying tablet replica has been cleaned up, by just converting internal error to plain exception. Fixes #26229. Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com> Closes scylladb/scylladb#27078	2025-11-20 11:44:03 +02:00
Botond Dénes	6ee0f1f3a7	Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski This patch adds a metric for pre-compression size of sstable files. This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent. As for the implementation: Before the patch, tables and sstable sets are already tracking their total physical file size. Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats. To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size. New functionality, no backport needed. Closes scylladb/scylladb#26996 * github.com:scylladb/scylladb: replica/table: add a metric for hypothetical total file size without compression replica/table: keep track of total pre-compression file size	2025-11-20 09:10:38 +02:00
Pavel Emelyanov	9cb776dee8	sstables_manager: Drop db::config from sstables_manager Now it has all it needs via its own specific config. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	675eb3be98	sstables_manager: Hold sstable IO extensions on it Currently manager holds a reference on db::config and when sstables IO extensions are needed it grabs them from this config. Since db::config is going to be removed from sstables manager, it should either keep track of all config extensions, or only those that it needs. This patch makes the latter choice and keeps reference to sstable_file_io_ext. on manager. The reference is passed as constructor argument, not via manager config, but it's a random choice, no specific reason why not putting it on config itself. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	9868341c73	sstables_manager: Move default format on config It's explicitly `me` type by default, but places that can write sstables override it with db::config value: replica::database, tests and scylla sstable tool. Live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	e6dee8aab5	sstables_manager: Move enable_sstable_data_integrity_check on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	78ab31118e	sstables_manager: Move data_file_directories on config Make it a reference, so all the code that configures it is updated to provide the target. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:50 +03:00
Pavel Emelyanov	cb1679d299	sstables_manager: Move components_memory_reclaim_threshold on config Set its default value to the one from db/config.cc. Only the replica::database and tests may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 19:31:42 +03:00
Botond Dénes	8579e20bd1	Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk This PR enables integrity check of both checksum and digest for repair/streaming. In the past, streaming readers only verified the checksum of compressed SSTables. This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped. To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression. Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`. Backport is not required, it is an improvement Refs #21776 Closes scylladb/scylladb#26444 * github.com:scylladb/scylladb: boost/repair_test: add repair reader integrity verification test cases test/lib: allow to disable compression in random_mutation_generator sstables: Skip checksum and digest reads for unlinked SSTables table: enable integrity checks for streaming reader table: Add integrity option to table::make_sstable_reader() sstables: Add integrity option to create_single_key_sstable_reader	2025-11-14 18:00:33 +02:00
Pavel Emelyanov	604e5b6727	sstables_manager: Move column_index_auto_scale_threshold on config Set its default value to the one from db/config.cc. Only the replica::database may want to re-configure it. This one is live-updateable, so use updateable_value<> type. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:49 +03:00
Pavel Emelyanov	8f9f92728e	sstables_manager: Move column_index_size on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:30:28 +03:00
Pavel Emelyanov	88bb203c9c	sstables_manager: Move sstable_summary_ratio on config Set its default value to the one from db/config.cc. Only replica::database may want to re-configure it. Also not live-updateable. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:29:34 +03:00
Pavel Emelyanov	1f6918be3f	sstables_manager: Move enable_sstable_key_validation on config Make it OFF by default and update only those callers, that may have it ON -- the replica::database, tests and scylla-sstable tool. Also not live-updateable, so plain bool. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:28:14 +03:00
Pavel Emelyanov	79d0f93693	sstables_manager: Move available_memory on config Currently, this parameter is passed to sstables_manager as explicit constructor argument. Also, it's not live-updateable, so a plain size_t type for it. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:27:14 +03:00
Pavel Emelyanov	218916e7c2	code: Introduce sstables_manager::config This is specific configuration for sstables_manager. All places that construct sstables manager are updated to provide config to it. For now the config is empty and exists alongside with db::config. Further patches will populate the former config with data and the latter config will be eventually removed. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>	2025-11-14 14:25:18 +03:00
Michał Chojnowski	346e0f64e2	replica/table: add a metric for hypothetical total file size without compression This patch adds a per-table metric `scylla_column_family_total_disk_space_before_compression`, which measures the hypothetical total size of sstables on disk, if Data.db was replaced with an uncompressed equivalent.	2025-11-13 11:28:19 +01:00
Michał Chojnowski	1cfce430f1	replica/table: keep track of total pre-compression file size Every table and sstable set keeps track of the total file size of contained sstables. Due to a feature request, we also want to keep track of the hypothetical file size if Data files were uncompressed, to add a metric that shows the compression ratio of sstables. We achieve this by replacing the relevant `uint_64 bytes_on_disk` counters everywhere with a struct that contains both the actual (post-compression) size and the hypothetical pre-compression size. This patch isn't supposed to change any observable behavior. In the next patch, we will use these changes to add a new metric.	2025-11-13 00:49:57 +01:00
Michael Litvak	de321218bc	storage_proxy: apply counter mutation on all write shards When applying a counter mutation, use apply_on_shards to apply the mutation on all write shards, similarly to the way other mutations are applied in the storage proxy. Previously the mutation was applied only on the current shard which is the read shard. This is needed to respect the write_both stages of intranode migration where we need to apply the mutation on both the old and the new shards.	2025-11-03 16:03:29 +01:00
Michael Litvak	c7e7a9e120	storage_proxy: move counter update coordination to storage proxy Refactor the counter update to split the functions and have them called by the storage proxy to prepare for a later change. Previously in mutate_counter the storage proxy calls the replica function apply_counter_update that does a few things: 1. checks that the operation can be done: check timeout, disk utilization 2. acquire counter locks 3. do read-modify-write and transform the counter mutation 4. apply the mutation in the replica In this commit we change it so that these functions are split and called from the storage proxy, so that we have better control from the storage proxy when we change it later to work across multiple shards. For example, we will want to acquire locks on multiple shards, transform it on one shard, and then apply the mutation on multiple shards. After the change it works as follows in storage proxy: 1. acquire counter locks 2. call replica prepare to check the operation and transform the mutation 3. call replica apply to apply the transformed mutation	2025-11-03 15:59:46 +01:00
Michael Litvak	7cc6b0d960	replica/db: add counter update guard Add a RAII guard for counter update that holds the counter locks and the table operation, and extract the creation of the guard to a separate function. This prepares it for a later change where we will want to obtain the guard externally from the storage proxy.	2025-11-03 08:43:11 +01:00
Michael Litvak	88fd9a34c4	replica/db: split counter update helper functions Split do_apply_counter_update to a few smaller and simpler functions to help prepare for a later change.	2025-11-03 08:43:11 +01:00
Lakshmi Narayanan Sreethar	3eb7193458	backlog_controller: compute backlog even when static shares are set The compaction manager backlog is exposed via metrics, but if static shares are set, the backlog is never calculated. As a result, there is no way to determine the backlog and if the static shares need adjustment. Fix that by calculating backlog even when static shares are set. Fixes #26287 Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com> Closes scylladb/scylladb#26778	2025-10-31 18:18:36 +02:00
Tomasz Grabiec	1c0d847281	Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations. This is the second part of the size based load balancing changes: - First part for tablet size collection via load_stats: #26035 - Second part reconcile load_stats: #26152 - The third part for load_sketch changes: #26153 - The fourth part which performs tablet load balancing based on tablet size: #26254 This is a new feature and backport is not needed. Closes scylladb/scylladb#26152 * github.com:scylladb/scylladb: load_balancer: load_stats reconcile after tablet migration and table resize load_stats: change data structure which contains tablet sizes	2025-10-31 09:58:25 +01:00

1 2 3 4 5 ...

1764 Commits