scylladb

mirror of https://github.com/scylladb/scylladb.git synced 2026-05-22 15:52:13 +00:00

Author	SHA1	Message	Date
Szymon Malewski	6b2fce03f9	alternator: optional stripping of http response headers In Alternator's HTTP API, response headers can dominate bandwidth for small payloads. The Server, Date, and Content-Type headers were sent on every response but many clients never use them. This patch introduces three Alternator config options: - alternator_http_response_server_header, - alternator_http_response_disable_date_header, - alternator_http_response_disable_content_type_header, which allow customizing or suppressing the respective HTTP response headers. All three options support live update (no restart needed). The Server header is no longer sent by default; the Date and Content-Type defaults preserve the existing behavior. The Server and Date header suppression uses Seastar's set_server_header() and set_generate_date_header() APIs added in https://github.com/scylladb/seastar/pull/3217. This patch also fixes deprecation warnings from older Seastar HTTP APIs. Tests are in test/alternator/test_http_headers.py. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70 Closes scylladb/scylladb#28288	2026-05-19 10:47:13 +03:00
Piotr Dulikowski	26671d4d5f	Merge 'Refactor view_update_builder' from Wojciech Mitros This series improves the readability and structure of view_update_builder, the component that generates materialized view updates from base-table mutations. The first four patches are pure renames and refactoring with no semantic changes: 1. Document that the builder operates on a single base partition. 2. Rename member fields to clearly distinguish readers (the mutation_reader streams) from the cached fragments (the last mutation_fragment_v2 read from each stream). 3. Rename advance/on_results methods to names that describe what they actually do: read the next fragment, or generate view updates. 4. Extract partition-start handling into its own method. The next two patches are minor optimizations: 5. Simplify clustering-row handling by moving the row out of the fragment before applying the tombstone, avoiding an unnecessary memory-usage recalculation in the reader permit. 6. Replace deep copies with moves in the existing-only tail path, matching the pattern used everywhere else. Finally, patch 7 deduplicates the fragment-consuming logic by extracting the three repeated blocks into consume_both_fragments(), consume_update_fragment(), and consume_existing_fragment(). Code reorganization - no backport needed Closes scylladb/scylladb#29497 * github.com:scylladb/scylladb: mv: deduplicate code for consuming fragments in view_update_builder mv: avoid unnecessary copies of existing rows in generate_updates() mv: simplify clustering row handling in generate_updates() mv: rename methods in view_update_builder for clarity mv: rename view_update_builder readers and cached fragments mv: drop redundant std::move from partition key extraction mv: document single-partition builder scope	2026-05-18 15:52:26 +02:00
Piotr Dulikowski	5efb43195e	Merge 'db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE' from Michał Jadwiszczak After recent change (`1a32ccd`) `make_update_indices_mutations()` is unconditionally adding a mutation for `system.view_building_tasks`, even when no indices were being dropped. In a mixed-version cluster, the older node may not have this table, causing the Raft schema applier to fail with 'Can't find a column family with UUID ...'. This patch fixes the bug by emitting the mutation when indices are actually dropped (i.e., when the view building cleanup code path was entered). Fixes: SCYLLADB-2026 Refs: scylladb#26557 scylladb#26557 wasn't backported, so this patch also doesn't need to be. Closes scylladb/scylladb#29908 * github.com:scylladb/scylladb: db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE db/view_building_task_mutation_builder: add `empty()` method	2026-05-18 15:37:02 +02:00
Michał Jadwiszczak	a9b2baf36b	db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE After recent change (`1a32ccd`) `make_update_indices_mutations()` is unconditionally adding a mutation for `system.view_building_tasks`, even when no indices were being dropped. In a mixed-version cluster, the older node may not have this table, causing the Raft schema applier to fail with 'Can't find a column family with UUID ...'. This patch fixes the bug by emitting the mutation when indices are actually dropped (i.e., when the view building cleanup code path was entered). Fixes: SCYLLADB-2026 Refs: scylladb#26557	2026-05-18 10:01:21 +02:00
Michał Jadwiszczak	82eb5611ab	db/view_building_task_mutation_builder: add `empty()` method The method allows to check if the builder contains any changes, so it will allow to skip emitting empty mutation.	2026-05-18 09:54:26 +02:00
Petr Gusev	9e3209e4a3	cql: refactor add_tablet_info to take tablet_routing_info directly Change add_tablet_info() to accept locator::tablet_routing_info instead of destructured (tablet_replica_set, token_range) pair. This simplifies all three call sites. Remove the empty-replicas guard inside add_tablet_info(): the only producer of tablet_routing_info is tablet ERM's check_locality(), which returns either nullopt (correctly routed) or info with replicas copied from tablet_info — a tablet always has replicas. All callers already check for nullopt before calling add_tablet_info(), so by the time we enter the function replicas are guaranteed non-empty.	2026-05-15 12:28:33 +02:00
Michał Jadwiszczak	b175f5b97d	db/view/view_building_worker: add more logs when flushing base table Add debug logs around flushing the base table to see how long does it take in case of some stalls in view building. Refs SCYLLADB-1261	2026-05-14 10:23:42 +02:00
Piotr Dulikowski	3c2c814215	Merge 'db/view/view_building: replace system keyspace functions with mutation builder' from Michał Jadwiszczak `system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations. This PR removes all `make_..._mutation()` system keyspace functions related to view building tasks and replaces them with mutation builder. Refs https://github.com/scylladb/scylladb/issues/25929 This patch doesn't fix any bug, it only reduces number of generated mutations, no need to backport it. Closes scylladb/scylladb#26557 * github.com:scylladb/scylladb: db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder db/view/view_building_task_mutation_builder: make uuid generator optional db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder db/view/view_building_task_mutation_builder: add helper method	2026-05-13 16:10:55 +02:00
Michał Jadwiszczak	1a32ccd8f6	db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder Again, get rid of system keyspace method in favor of mutation builder, because `system.view_building_tasks` is a single parition table.	2026-05-13 10:06:18 +02:00
Michał Jadwiszczak	2561cc1546	db/view/view_building_task_mutation_builder: make uuid generator optional After scylladb/scylladb#28929 `task_uuid_generator` became necassary dependency of `view_building_task_mutation_builder`. However to create the generator we need `view_building_state`, which in some parts of the code (schema_tables.cc, migration_manager.cc) requires remote proxy to be obtained. But sometimes we need the mutation builder to just remove some view building task. In those cases, we don't need the uuid generator and the remote proxy requirement is not necassary.	2026-05-13 09:58:27 +02:00
Michał Jadwiszczak	e002665aa7	db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder `system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.	2026-05-12 21:49:18 +02:00
Michał Jadwiszczak	4227cab5cb	db/view/view_building_task_mutation_builder: add helper method Add a method to set all task's fields.	2026-05-12 21:28:06 +02:00
Botond Dénes	e95eb21a16	Merge 'Tablet-aware restore' from Pavel Emelyanov The mechanics of the restore is like this - A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet - The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas - Each replica handles the RPC verb by - Reading the snapshot_sstables table - Filtering the read sstable infos against current node and tablet being handled - Downloading and attaching the filtered sstables This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code. This is first step towards SCYLLADB-197 and lacks many things. In particular - the API only works for single-DC cluster - the caller needs to "lock" tablet boundaries with min/max tablet count - not abortable - no progress tracking - sub-optimal (re-kicking API on restore will re-download everything again) - not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node) - nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup) Other follow-up items: - have an actual swagger object specification for `backup_location` Closes #28436 Closes #28657 Closes #28773 Closes scylladb/scylladb#28763 * github.com:scylladb/scylladb: docs: Update topology_over_raft.md with `restore` transition kind test: Add test for backup vs migration race test: Restore resilience test sstables_loader: Fail tablet-restore task if not all sstables were downloaded sstables_loader: mark sstables as downloaded after attaching sstables_loader: return shared_sstable from attach_sstable db: add update_sstable_download_status method db: add downloaded column to snapshot_sstables db: extract snapshot_sstables TTL into class constant test: Add a test for tablet-aware restore tablets: Implement tablet-aware cluster-wide restore messaging: Add RESTORE_TABLET RPC verb sstables_loader: Add method to download and attach sstables for a tablet tablets: Add restore_config to tablet_transition_info sstables_loader: Add restore_tablets task skeleton test: Add rest_client helper to kick newly introduced API endpoint api: Add /storage_service/tablets/restore endpoint skeleton sstables_loader: Add keyspace and table arguments to manfiest loading helper sstables_loader_helpers: just reformat the code sstables_loader_helpers: generalize argument and variable names sstables_loader_helpers: generalize get_sstables_for_tablet sstables_loader_helpers: add token getters for tablet filtering sstables_loader_helpers: remove underscores from struct members sstables_loader: move download_sstable and get_sstables_for_tablet sstables_loader: extract single-tablet SST filtering sstables_loader: make download_sstable static sstables_loader: fix formating of the new `download_sstable` function sstables_loader: extract single SST download into a function sstables_loader: add shard_id to minimal_sst_info sstables_loader: add function for parsing backup manifests split utility functions for creating test data from database_test export make_storage_options_config from lib/test_services rjson: Add helpers for conversions to dht::token and sstable_id Add system_distributed_keyspace.snapshot_sstables add get_system_distributed_keyspace to cql_test_env code: Add system_distributed_keyspace dependency to sstables_loader storage_service: Export export handle_raft_rpc() helper storage_service: Export do_tablet_operation() storage_service: Split transit_tablet() into two tablets: Add braces around tablet_transition_kind::repair switch	2026-05-12 16:24:13 +03:00
Piotr Dulikowski	7c2b1ea0b5	Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak `system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters. Two-part fix: 1. Range tombstones instead of row tombstones (commits 2–3) Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction. 2. Bounded scan with `min_task_id` (commits 4–6) Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all. - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`. - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch). - On reload, read `min_task_id` first using a static-only partition slice (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted. - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows. The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan. The issue is not critical, so the fix shouldn't be backported. Fixes SCYLLADB-657 Closes scylladb/scylladb#28929 * github.com:scylladb/scylladb: test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning docs: document tombstone avoidance in view_building_tasks view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` view_building: introduce `task_uuid_generator` view_building: store `min_alive_uuid` in view building state view_building: set min_task_id when GC-ing finished tasks view_building: add min_task_id support to view_building_task_mutation_builder view_building: add min_task_id static column and bounded scan to system_keyspace view_building: use range tombstone when GC-ing finished tasks view_building: add range tombstone support to view_building_task_mutation_builder view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature	2026-05-12 12:38:25 +03:00
Pavel Emelyanov	1c0f8ab66e	Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site. This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths: - Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`) - Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`) - `parse_assert()` failures (via `on_parse_error()`) - BTI parse errors (via `on_bti_parse_error()`) The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure. The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption. Commit breakdown: 1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc` 2. `on_parse_error()` and `on_bti_parse_error()` check the new flag 3. All ~50 `throw malformed_sstable_exception(...)` sites migrated 4. Both `throw bufsize_mismatch_exception(...)` sites migrated Refs: SCYLLADB-1087 Backport: new feature, no backport Closes scylladb/scylladb#29324 * github.com:scylladb/scylladb: sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception() sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception() sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose sstables: introduce --abort-on-malformed-sstable-error infrastructure sstables: refactor parse_path() to return std::expected<> instead of throwing	2026-05-12 12:38:25 +03:00
Ernest Zaslavsky	7eb921a142	db: add update_sstable_download_status method Add a method to update the downloaded status of a specific SSTable entry in system_distributed.snapshot_sstables. This will be used by the tablet restore process to mark SSTables as downloaded after they have been successfully attached to the local table.	2026-05-12 10:40:23 +03:00
Ernest Zaslavsky	83ec7e22b9	db: add downloaded column to snapshot_sstables Add a 'downloaded' boolean column to the snapshot_sstables table schema and the corresponding field to the snapshot_sstable_entry struct. Update insert_snapshot_sstable() and get_snapshot_sstables() to write and read this column. This column will be used to track which SSTables have been successfully downloaded during a tablet restore operation. Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>	2026-05-12 10:40:23 +03:00
Ernest Zaslavsky	61c627a7c0	db: extract snapshot_sstables TTL into class constant Move the TTL value used for snapshot_sstables rows from a local variable in insert_snapshot_sstable() to a class-level constant SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.	2026-05-12 10:40:23 +03:00
Robert Bindar	2f19d84ad7	Add system_distributed_keyspace.snapshot_sstables This patch adds the snapshot_sstables table with the following schema: ```cql CREATE TABLE system_distributed.snapshot_sstables ( snapshot_name text, keyspace text, table text, datacenter text, rack text, id uuid, first_token bigint, last_token bigint, toc_name text, prefix text) PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id); ``` The table will be populated by the coordinator node during the restore phase (and later on during the backup phase to accomodate live-restore). The content of this table is meant to be consumed by the restore worker nodes which will use this data to filter and file-based download sstables. Fixes SCYLLADB-263 Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>	2026-05-12 10:40:21 +03:00
Taras Veretilnyk	881776b441	db: add large data metrics for rows, cells, and collections Previously only large_partition_exceeding_threshold was exposed as a metric. Add three new counters to large_data_handler::stats and register corresponding Prometheus metrics: - large_rows_exceeding_threshold - large_cell_exceeding_threshold - large_collection_exceeding_threshold The counters are incremented in maybe_record_large_rows() and maybe_record_large_cells() following the same pattern used by the existing partition metric.	2026-05-11 23:11:17 +02:00
Botond Dénes	ad7ac62835	Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562 No need to backport this, keyspace over object storage is experimental feature Closes scylladb/scylladb#29659 * github.com:scylladb/scylladb: db, sstables: add node_owner to sstables registry primary key db, sstables: rename sstables registry column owner to table_id	2026-05-11 14:08:19 +03:00
Botond Dénes	f6dc2cb5f8	sstables: introduce --abort-on-malformed-sstable-error infrastructure Add the --abort-on-malformed-sstable-error command-line option and the supporting infrastructure. When set, any malformed sstable error will abort the process and generate a coredump instead of throwing an exception. This is useful for debugging memory corruption that may manifest as apparent sstable corruption. The implementation introduces: - throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception() helper functions in sstables/sstables.cc, which check the new flag and either abort (with logging) or throw the appropriate exception. - set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error() to control the per-process atomic flag. - abort_on_malformed_sstable_error config option (LiveUpdate, default false) wired up in main.cc alongside abort_on_internal_error. Call-site migration will follow in subsequent commits.	2026-05-11 11:58:14 +03:00
Botond Dénes	c3daa6379c	sstables: refactor parse_path() to return std::expected<> instead of throwing make_entry_descriptor() and the two overloads of parse_path() used to signal parse failures by throwing malformed_sstable_exception, which made parse_path() expensive to use as a probe (e.g. to classify directory entries). Change make_entry_descriptor() and both parse_path() overloads to return std::expected<T, sstring>, where the sstring carries the error message on failure, eliminating the exception overhead at probe call sites. Call sites that previously caught malformed_sstable_exception to treat the path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc, tools/scylla-sstable.cc) now check the expected result directly. Call sites where a parse failure is a genuine error (sstable_directory.cc, sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw explicitly as malformed_sstable_exception using the error string, preserving the existing error propagation behaviour.	2026-05-11 11:58:14 +03:00
Botond Dénes	9b2dfab2e5	Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor. Code dependencies refactoring, not backporting Closes scylladb/scylladb#29635 * github.com:scylladb/scylladb: view: Turn calculate_view_update_throttling_delay into node_update_backlog member view: Place view_flow_control_delay_limit_in_ms on node_update_backlog view: Add node_update_backlog reference to view_update_generator	2026-05-11 10:30:24 +03:00
Botond Dénes	3f72852d8c	Merge 'Fix missing format string placeholders across the codebase (33 bugs across 14 modules )' from Yaniv Kaul Fix 28 format string bugs plus 5 related format argument bugs across 14 modules where `{}` placeholders were missing or arguments were wrong, causing arguments to be silently dropped or misleading output from the `{fmt}` library. Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single instance in `replica/table.cc`), a comprehensive audit of the entire codebase was performed to find all similar issues. - Missing `{}` placeholder (21 instances): format string simply lacks `{}` for a passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id` is silently dropped - Spurious comma breaking C++ string literal concatenation (2 instances): a comma after a string literal prevents adjacent-literal concatenation, turning the continuation into a format argument instead of part of the format string - Printf-style `%s` in fmtlib context (4 instances): `%s` has no meaning in fmtlib and appears as literal text while the argument is silently ignored - Extra spurious argument (1 instance): an extraneous `t.tomb()` argument inserted between correct arguments, causing wrong values in the wrong slots - Wrong variable in error message (4 instances in `types/map.hh`): error messages for oversized map keys/values reported `map_size` (total entry count) instead of the actual `elem.first.size()` or `elem.second.size()` that exceeded the limit - Swapped argument order (1 instance in `data_dictionary/data_dictionary.cc`): format string says `"Extraneous options for {type}: {values}"` but the values and type arguments were passed in reverse order \| Module \| Bugs Fixed \| Files \| \|--------\|:---------:\|-------\| \| `replica/` \| 1 \| `table.cc` \| \| `service/` \| 4 \| `raft_group0.cc`, `storage_service.cc` \| \| `db/` \| 6 \| `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` \| \| `cql3/` \| 2 \| `prepare_expr.cc`, `statement_restrictions.cc` \| \| `transport/` \| 4 \| `event_notifier.cc` \| \| `sstables/` \| 3 \| `partition_reversing_data_source.cc`, `reader.cc` \| \| `alternator/` \| 1 \| `conditions.cc` \| \| `cdc/` \| 1 \| `split.cc` \| \| `raft/` \| 1 \| `server.cc` \| \| `utils/` \| 2 \| `gcp/object_storage.cc`, `s3/client.cc` \| \| `mutation/` \| 1 \| `mutation_partition.hh` \| \| `ent/` \| 2 \| `kmip_host.cc`, `kms_host.cc` \| \| `types/` \| 4 \| `map.hh` \| \| `data_dictionary/` \| 1 \| `data_dictionary.cc` \| The `{fmt}` library's compile-time checker validates that each `{}` placeholder references a valid argument, but does not verify the reverse -- that every argument has a corresponding placeholder. Extra arguments are silently ignored at both compile time and runtime. Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly. --- Note: Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul". Closes scylladb/scylladb#29448 * github.com:scylladb/scylladb: data_dictionary: fix swapped arguments in extraneous options error types: fix wrong variable in map key/value size error messages ent: fix missing format placeholders in encryption error/log messages mutation: fix spurious argument in shadowable_tombstone formatter utils: fix missing format placeholders in object storage log messages raft: fix missing format placeholder in server ostream operator cdc: fix missing format placeholder in error message alternator: fix missing format placeholder in error message sstables: fix missing format placeholders in error messages transport: fix printf-style format specifiers in fmtlib log calls cql3: fix missing format placeholders in error messages db: fix missing format placeholders in log and error messages service: fix missing format placeholders in log messages replica: fix missing format placeholder in cleanup log message	2026-05-11 07:04:42 +03:00
Nadav Har'El	df8c9b17b8	Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental. So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature. Finally, stop providing the config flag in test configs. Fixes SCYLLADB-1680 Fixes #16367 To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains. This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport. Closes scylladb/scylladb#29604 * github.com:scylladb/scylladb: test: Stop providing alternator-streams experimental flag alternator: Graduate Alternator Streams from experimental	2026-05-10 22:10:03 +03:00
Yaniv Kaul	fdebed5746	db: fix missing format placeholders in log and error messages Fix six format string bugs where arguments were silently dropped: - heat_load_balance.cc: pp value was passed but had no {} placeholder. - commitlog_replayer.cc: column_family_id was passed but table= had no {} placeholder. - view_update_generator.cc: _sstables_with_tables.size() was passed but had no {} placeholder. - view_building_worker.cc: exception pointer was passed but the trailing colon had no {} placeholder. - row_locking.cc: partition key and clustering key were passed in error messages but had no {} placeholders. Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>	2026-05-10 17:49:50 +03:00
Avi Kivity	5a887362e3	Merge 'Remove legacy tables creation code' from Gleb Natapov Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation. No backport needed since this removes functionality. Closes scylladb/scylladb#29482 * github.com:scylladb/scylladb: db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2 db/system_distributed_keyspace: remove unused code db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table db/system_distributed_keyspace: drop old service_levels table fix indent after the previous patch group0: call setup_group0 only when needed	2026-05-10 14:46:21 +03:00
Calle Wilund	8d65a03951	commitlog: Fix segment/chunk overhead maybe not included in next_position calculation Refs: SCYLLADB-1757 Refs: SCYLLADB-1815 If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.	2026-05-05 14:39:06 +02:00
Dimitrios Symonidis	c40842f60a	db, sstables: add node_owner to sstables registry primary key Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomes PRIMARY KEY ((table_id, node_owner), generation). This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows. The new column is populated via sstables_manager::get_local_host_id(). No backward compatibility is preserved; the feature is experimental and gated by keyspace-storage-options.	2026-04-24 16:41:09 +02:00
Dimitrios Symonidis	ce78c5113e	db, sstables: rename sstables registry column owner to table_id The partition-key column in system.sstables named 'owner' actually holds a table_id. Rename the CQL column and the matching C++ parameter and member names so the identifier describes what it stores. No behavior change. This prepares the schema for an upcoming node_owner partition-key column (the local host id), which needs a free name.	2026-04-24 16:24:07 +02:00
Pavel Emelyanov	111165d9de	view: Turn calculate_view_update_throttling_delay into node_update_backlog member The free function calculate_view_update_throttling_delay() took the view_flow_control_delay_limit_in_ms as a parameter, which forced its two callers (storage_proxy and view_update_generator) to fish the option out of db::config via database::get_config(). Now that the option lives on node_update_backlog, make the throttling calculation a member of node_update_backlog and have the callers invoke it on their node_update_backlog reference. This removes two database::get_config() call sites. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 13:52:12 +03:00
Pavel Emelyanov	855372db3c	view: Place view_flow_control_delay_limit_in_ms on node_update_backlog Store the view_flow_control_delay_limit_in_ms config option as an updateable_value on node_update_backlog. The value is threaded from main.cc into the backlog object at construction time. Existing call sites (tests) that construct node_update_backlog without the option continue to work via a default argument. Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 13:47:54 +03:00
Pavel Emelyanov	ec2339e635	view: Add node_update_backlog reference to view_update_generator Pass node_update_backlog explicitly to view_update_generator via its constructor and start() call. This is plumbing only; no behavior change. A subsequent patch will use this reference to compute view update throttling delays without going through database::get_config(). Signed-off-by: Pavel Emelyanov <xemul@scylladb.com> Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-24 13:45:46 +03:00
Piotr Szymaniak	870013b437	alternator: Graduate Alternator Streams from experimental Alternator Streams were experimental until 2026.2, when they became GA. Stop requiring `--experimental-features=alternator-streams` by: - Removing ALTERNATOR_STREAMS from the experimental feature enum - Mapping "alternator-streams" to UNUSED for backward compatibility - Removing the gating that disabled the ALTERNATOR_STREAMS gossip feature when the experimental flag was absent - Removing the runtime guard that rejected StreamSpecification requests without the feature flag - Updating config_test to reflect the new UNUSED mapping The gms::feature alternator_streams is kept for rolling upgrade compatibility with older nodes. Fixes SCYLLADB-1680	2026-04-22 15:22:15 +02:00
Botond Dénes	18ceeaf3ef	Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc() previously returned all sstables across all three views (unrepaired, repairing, repaired). This is correct but unnecessarily expensive: the unrepaired and repairing sets are never the source of a GC-blocking shadow when tombstone_gc=repair, for base tables. The key ordering guarantee that makes this safe is: - topology_coordinator sends send_tablet_repair RPC and waits for it to complete. Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from repairing → repaired (repaired_at stamped on disk). - Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at to Raft. - gc_before = repair_time - propagation_delay only advances once that Raft commit applies. Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its deletion_time < gc_before), any data D it shadows is already in the repaired set on every replica. This holds because: - The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot calls sg->flush()), capturing all data present at repair time. - Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes arrive before the snapshot boundary. - Legitimate unrepaired data has timestamps close to 'now', always newer than any GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB). Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any tombstone to be wrongly collected. The memtable check is also skipped for the same reason: memtable data is either newer than the GC-eligible tombstone, or was flushed into the repairing/repaired set before gc_before advanced. Safety restriction — materialized views: The optimization IS applied to materialized view tables. Two possible paths could inject D_view into the MV's unrepaired set after MV repair: view hints and staging via the view-update-generator. Both are safe: (1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view hints — including D_view entries queued in _hints_for_views_manager while the target MV replica was down — have been replayed to the target node before take_storage_snapshot() is called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired. When a repaired compaction then checks for shadows it finds D_view in the repaired set, keeping T_mv non-purgeable. (2) View-update-generator staging path: Base table repair can write a missing D_base to a replica via a staging sstable. The view-update-generator processes the staging sstable ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed repair_time and T_mv has been GC'd from the repaired set. However, the staging processor calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via as_mutation_source_excluding_staging(): it reads the CURRENT base table state before building the view update. If T_base was written to the base table (as it always is before the base replica can be repaired and the MV tombstone can become GC-eligible), the view_update_builder sees T_base as the existing partition tombstone. D_base's row marker (ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never dispatched to the MV replica. No resurrection can occur regardless of how long staging is delayed. A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at). D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow check, keeping T_base non-purgeable as long as D_base remains in staging. A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The resulting view update is generated synchronously on the base replica and sent to the MV replica via _hints_for_views_manager (path 1 above), not via staging. USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly UB and excluded from the safety argument. For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant does not hold for base tables either, so the full storage-group set is returned. The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired compactions: the unrepaired set is typically the largest (it holds all recent writes), yet for tombstone_gc=repair it never influences GC decisions. Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231. Closes scylladb/scylladb#29310 * github.com:scylladb/scylladb: compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode test/repair: Add tombstone GC safety tests for incremental repair	2026-04-22 10:21:37 +03:00
Michał Jadwiszczak	1162fd315e	view_building: add `task_uuid_generator` to `view_building_task_mutation_builder` Following previous commit, use the generator in view building task mutation builder.	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	b64f2d2e90	view_building: introduce `task_uuid_generator` With the new `min_alive_uuid` saved in the group0 table, we need to make sure that all new tasks are created with time uuid greater than the value saved in `min_alive_uuid`. This patch introduces the `task_uuid_generator` which ensures that when we are generating multiple tasks in one group0 command, each task will have an unique time uuid and each time uuid will be greater than `min_alive_uuid`.	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	e5a6ed72b9	view_building: store `min_alive_uuid` in view building state Because now we're limiting the range we're reading from view building tasks table, we need to make sure that new tasks are created with larger uuid then the `min_alive_uuid`. In order to do it, we need to be able to see current `min_alive_uuid` while creating new tasks.	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	8d0943ce35	view_building: set min_task_id when GC-ing finished tasks When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id alongside the range tombstone in the same Raft batch. min_task_id is set to min_alive_uuid so subsequent get_view_building_tasks() scans start exactly at the first alive row, skipping all tombstoned rows. When all tasks are deleted, min_task_id is set to a freshly generated UUID to ensure future tasks (which will have larger timeuuids) are not skipped. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	b689de0414	view_building: add min_task_id support to view_building_task_mutation_builder Add set_min_task_id(id) which writes the min_task_id static cell to the main "view_building" partition. The static cell is written as part of the same mutation as the range tombstone, keeping everything in one Raft batch. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	8670111cd4	view_building: add min_task_id static column and bounded scan to system_keyspace Add a min_task_id timeuuid static column to system.view_building_tasks. When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks() reads min_task_id first using a static-only partition slice (empty _row_ranges + always_return_static_content). This makes the SSTable reader stop immediately after the static row before processing any clustering tombstones, so the read never triggers tombstone_warn_threshold warnings. min_task_id is then used as AND id >= ? lower bound for the main task scan, skipping all tombstoned rows below the boundary. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:14 +02:00
Michał Jadwiszczak	8f741b462b	view_building: use range tombstone when GC-ing finished tasks Instead of issuing one row tombstone per finished task, collect all tasks to delete, find the smallest timeuuid among alive tasks (min_alive_uuid), then emit a single range tombstone [before_all, min_alive_uuid) covering all tasks below that boundary. Tasks above the boundary (rare: finished task interleaved with alive tasks) still get individual row tombstones. When no alive tasks remain, del_all_tasks() covers the entire partition with a single range tombstone. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:13 +02:00
Michał Jadwiszczak	91697d597c	view_building: add range tombstone support to view_building_task_mutation_builder Add del_tasks_before(id) which emits a range tombstone [before_all, id) and del_all_tasks() which covers the entire clustering range. These will be used by the coordinator to delete finished tasks in bulk instead of issuing one row tombstone per task. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>	2026-04-22 09:10:13 +02:00
Tomasz Grabiec	cddde464ca	Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor. In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF. In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished: - in system_schema.keyspaces: - next_replication is cleared; - new keyspace properties are saved; - request is removed from ongoing_rf_changes; - the request is marked as done in system.topology_requests. Until the request is done, DESCRIBE KEYSPACE shows the replication_v2. If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication. Fixes: SCYLLADB-567. No backport needed; new feature. Closes scylladb/scylladb#24421 * github.com:scylladb/scylladb: service: fix indentation docs: update documentation test: test multi RF changes service: tasks: allow aborting ongoing RF changes cql3: allow changing RF by more than one when adding or removing a DC service: handle multi_rf_change service: implement make_rf_change_plan service: add keyspace_rf_change_plan to migration_plan service: extend tablet_migration_info to handle rebuilds service: split update_node_load_on_migration service: rearrange keyspace_rf_change handler db: add columns to system_schema.keyspaces db: service: add ongoing_rf_changes to system.topology gms: add keyspace_multi_rf_change feature	2026-04-22 01:46:11 +02:00
Wojciech Mitros	667a928e81	mv: deduplicate code for consuming fragments in view_update_builder Deduplicate the fragment-consuming logic in view_update_builder::generate_updates() by extracting it into three private methods: consume_both_fragments(), consume_update_fragment(), and consume_existing_fragment(). The three inlined blocks for cmp < 0, cmp > 0, and cmp == 0 were identical to the trailing "update only" and "existing only" blocks. The only semantic change is in the trailing "existing only" path: the outer tombstone guard is replaced by per-branch tombstone checks inside consume_existing_fragment(), which is both sufficient and more precise for the static_row case (uses partition tombstone only, not range tombstone which is irrelevant for static rows).	2026-04-22 00:26:52 +02:00
Wojciech Mitros	00be36e08f	mv: avoid unnecessary copies of existing rows in generate_updates() In the existing-only tail block of generate_updates(), the clustering row and static row were extracted from the fragment using a deep copy constructor (e.g. clustering_row(*_schema, fragment.as_clustering_row())) even though the fragment is not used afterwards. Replace with moves, matching the pattern used in all other cases.	2026-04-22 00:26:52 +02:00
Wojciech Mitros	74902dceac	mv: simplify clustering row handling in generate_updates() Two of the three clustering-row cases in generate_updates() used mutate_as_clustering_row() to apply a tombstone to the row in-place, then immediately moved the row out of the fragment. This triggered an unnecessary memory usage recalculation in the reader permit, since: 1. apply(tombstone) does not change external memory usage (tombstone is stored inline, not heap-allocated), so the recalculation will yield the same result. 2. The fragment is consumed on the very next line, so the tracking window is effectively zero. Simplify these two cases to match the first case (cmp < 0), which already uses the simpler pattern of moving the row out of the fragment first, then applying the tombstone on the extracted row.	2026-04-22 00:26:52 +02:00
Wojciech Mitros	7727a37085	mv: rename methods in view_update_builder for clarity Rename advance_all(), advance_updates() and advance_existings() to read_both_next_fragments(), read_next_update_fragment() and read_next_existing_fragment(), respectively. The new names make it clear that these methods read the next mutation fragment from the corresponding reader into the cached fragment member. Also rename on_results() to generate_updates(), which better describes its role of generating view updates from the previously read fragments.	2026-04-22 00:26:52 +02:00
Wojciech Mitros	490b3f5c6f	mv: rename view_update_builder readers and cached fragments Rename the members of view_update_builder to reflect their roles more precisely: _updates -> _update_reader _existings -> _existing_reader _update -> _update_fragment _existing -> _existing_fragment This makes the code easier to follow by distinguishing the readers (which produce a stream of fragments) from the cached fragments (the most recently read mutation_fragment_v2 from each reader).	2026-04-22 00:26:52 +02:00

1 2 3 4 5 ...

4972 Commits