In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.
This patch introduces three Alternator config options:
- alternator_http_response_server_header,
- alternator_http_response_disable_date_header,
- alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.
The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.
Tests are in test/alternator/test_http_headers.py.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70Closesscylladb/scylladb#28288
This series improves the readability and structure of
view_update_builder, the component that generates materialized view
updates from base-table mutations.
The first four patches are pure renames and refactoring with no
semantic changes:
1. Document that the builder operates on a single base partition.
2. Rename member fields to clearly distinguish readers (the
mutation_reader streams) from the cached fragments (the last
mutation_fragment_v2 read from each stream).
3. Rename advance/on_results methods to names that describe what
they actually do: read the next fragment, or generate view
updates.
4. Extract partition-start handling into its own method.
The next two patches are minor optimizations:
5. Simplify clustering-row handling by moving the row out of the
fragment before applying the tombstone, avoiding an unnecessary
memory-usage recalculation in the reader permit.
6. Replace deep copies with moves in the existing-only tail path,
matching the pattern used everywhere else.
Finally, patch 7 deduplicates the fragment-consuming logic by
extracting the three repeated blocks into consume_both_fragments(),
consume_update_fragment(), and consume_existing_fragment().
Code reorganization - no backport needed
Closesscylladb/scylladb#29497
* github.com:scylladb/scylladb:
mv: deduplicate code for consuming fragments in view_update_builder
mv: avoid unnecessary copies of existing rows in generate_updates()
mv: simplify clustering row handling in generate_updates()
mv: rename methods in view_update_builder for clarity
mv: rename view_update_builder readers and cached fragments
mv: drop redundant std::move from partition key extraction
mv: document single-partition builder scope
After recent change (1a32ccd) `make_update_indices_mutations()` is unconditionally adding a mutation for `system.view_building_tasks`, even when no indices were being dropped.
In a mixed-version cluster, the older node may not have this table, causing the Raft schema applier to fail with 'Can't find a column family with UUID ...'.
This patch fixes the bug by emitting the mutation when indices are actually dropped (i.e., when the view building cleanup code path was entered).
Fixes: SCYLLADB-2026
Refs: scylladb#26557
scylladb#26557 wasn't backported, so this patch also doesn't need to be.
Closesscylladb/scylladb#29908
* github.com:scylladb/scylladb:
db/schema_tables: don't emit empty view_building_tasks mutation on ALTER TABLE
db/view_building_task_mutation_builder: add `empty()` method
After recent change (1a32ccd) `make_update_indices_mutations()` is unconditionally
adding a mutation for `system.view_building_tasks`, even when no indices were being dropped.
In a mixed-version cluster, the older node may not have this table,
causing the Raft schema applier to fail with 'Can't find a column
family with UUID ...'.
This patch fixes the bug by emitting the mutation when indices are actually
dropped (i.e., when the view building cleanup code path was entered).
Fixes: SCYLLADB-2026
Refs: scylladb#26557
Change add_tablet_info() to accept locator::tablet_routing_info instead
of destructured (tablet_replica_set, token_range) pair. This simplifies
all three call sites.
Remove the empty-replicas guard inside add_tablet_info(): the only
producer of tablet_routing_info is tablet ERM's check_locality(), which
returns either nullopt (correctly routed) or info with replicas copied
from tablet_info — a tablet always has replicas. All callers already
check for nullopt before calling add_tablet_info(), so by the time we
enter the function replicas are guaranteed non-empty.
`system.view_building_tasks` is a single partition table, so it makes more sense to use a mutation builder and generate 1 mutation per group0 command instead of generating multiple mutations.
This PR removes all `make_..._mutation()` system keyspace functions related to view building tasks and replaces them with mutation builder.
Refs https://github.com/scylladb/scylladb/issues/25929
This patch doesn't fix any bug, it only reduces number of generated mutations, no need to backport it.
Closesscylladb/scylladb#26557
* github.com:scylladb/scylladb:
db/system_keyspace: replace `make_remove_view_building_task_mutation()` with mutation builder
db/view/view_building_task_mutation_builder: make uuid generator optional
db/system_keyspace: replace `make_view_building_task_mutation()` with mutation builder
db/view/view_building_task_mutation_builder: add helper method
After scylladb/scylladb#28929 `task_uuid_generator` became necassary
dependency of `view_building_task_mutation_builder`.
However to create the generator we need `view_building_state`, which in
some parts of the code (schema_tables.cc, migration_manager.cc) requires
remote proxy to be obtained.
But sometimes we need the mutation builder to just remove some view
building task. In those cases, we don't need the uuid generator and the
remote proxy requirement is not necassary.
`system.view_building_tasks` is a single partition table, so it makes
more sense to use a mutation builder and generate 1 mutation per group0
command instead of generating multiple mutations.
The mechanics of the restore is like this
- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
- First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
- Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
- Reading the snapshot_sstables table
- Filtering the read sstable infos against current node and tablet being handled
- Downloading and attaching the filtered sstables
This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.
This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)
Other follow-up items:
- have an actual swagger object specification for `backup_location`
Closes#28436Closes#28657Closes#28773Closesscylladb/scylladb#28763
* github.com:scylladb/scylladb:
docs: Update topology_over_raft.md with `restore` transition kind
test: Add test for backup vs migration race
test: Restore resilience test
sstables_loader: Fail tablet-restore task if not all sstables were downloaded
sstables_loader: mark sstables as downloaded after attaching
sstables_loader: return shared_sstable from attach_sstable
db: add update_sstable_download_status method
db: add downloaded column to snapshot_sstables
db: extract snapshot_sstables TTL into class constant
test: Add a test for tablet-aware restore
tablets: Implement tablet-aware cluster-wide restore
messaging: Add RESTORE_TABLET RPC verb
sstables_loader: Add method to download and attach sstables for a tablet
tablets: Add restore_config to tablet_transition_info
sstables_loader: Add restore_tablets task skeleton
test: Add rest_client helper to kick newly introduced API endpoint
api: Add /storage_service/tablets/restore endpoint skeleton
sstables_loader: Add keyspace and table arguments to manfiest loading helper
sstables_loader_helpers: just reformat the code
sstables_loader_helpers: generalize argument and variable names
sstables_loader_helpers: generalize get_sstables_for_tablet
sstables_loader_helpers: add token getters for tablet filtering
sstables_loader_helpers: remove underscores from struct members
sstables_loader: move download_sstable and get_sstables_for_tablet
sstables_loader: extract single-tablet SST filtering
sstables_loader: make download_sstable static
sstables_loader: fix formating of the new `download_sstable` function
sstables_loader: extract single SST download into a function
sstables_loader: add shard_id to minimal_sst_info
sstables_loader: add function for parsing backup manifests
split utility functions for creating test data from database_test
export make_storage_options_config from lib/test_services
rjson: Add helpers for conversions to dht::token and sstable_id
Add system_distributed_keyspace.snapshot_sstables
add get_system_distributed_keyspace to cql_test_env
code: Add system_distributed_keyspace dependency to sstables_loader
storage_service: Export export handle_raft_rpc() helper
storage_service: Export do_tablet_operation()
storage_service: Split transit_tablet() into two
tablets: Add braces around tablet_transition_kind::repair switch
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.
Two-part fix:
**1. Range tombstones instead of row tombstones (commits 2–3)**
Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.
**2. Bounded scan with `min_task_id` (commits 4–6)**
Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.
- Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
- On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
- On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
- Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.
The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.
The issue is not critical, so the fix shouldn't be backported.
Fixes SCYLLADB-657
Closesscylladb/scylladb#28929
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
docs: document tombstone avoidance in view_building_tasks
view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
view_building: introduce `task_uuid_generator`
view_building: store `min_alive_uuid` in view building state
view_building: set min_task_id when GC-ing finished tasks
view_building: add min_task_id support to view_building_task_mutation_builder
view_building: add min_task_id static column and bounded scan to system_keyspace
view_building: use range tombstone when GC-ing finished tasks
view_building: add range tombstone support to view_building_task_mutation_builder
view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.
This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:
- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)
The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.
The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.
**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated
Refs: SCYLLADB-1087
Backport: new feature, no backport
Closesscylladb/scylladb#29324
* github.com:scylladb/scylladb:
sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
sstables: introduce --abort-on-malformed-sstable-error infrastructure
sstables: refactor parse_path() to return std::expected<> instead of throwing
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
keyspace text, table text,
datacenter text, rack text,
id uuid,
first_token bigint, last_token bigint,
toc_name text, prefix text)
PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.
Fixes SCYLLADB-263
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Previously only large_partition_exceeding_threshold was exposed as a
metric. Add three new counters to large_data_handler::stats and register
corresponding Prometheus metrics:
- large_rows_exceeding_threshold
- large_cell_exceeding_threshold
- large_collection_exceeding_threshold
The counters are incremented in maybe_record_large_rows() and
maybe_record_large_cells() following the same pattern used by the
existing partition metric.
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature
Closesscylladb/scylladb#29659
* github.com:scylladb/scylladb:
db, sstables: add node_owner to sstables registry primary key
db, sstables: rename sstables registry column owner to table_id
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.
The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
helper functions in sstables/sstables.cc, which check the new flag and
either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
wired up in main.cc alongside abort_on_internal_error.
Call-site migration will follow in subsequent commits.
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).
Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.
Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.
Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.
Code dependencies refactoring, not backporting
Closesscylladb/scylladb#29635
* github.com:scylladb/scylladb:
view: Turn calculate_view_update_throttling_delay into node_update_backlog member
view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
view: Add node_update_backlog reference to view_update_generator
Fix 28 format string bugs plus 5 related format argument bugs across 14 modules
where `{}` placeholders were missing or arguments were wrong, causing arguments to
be silently dropped or misleading output from the `{fmt}` library.
Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single
instance in `replica/table.cc`), a comprehensive audit of the entire codebase was
performed to find all similar issues.
- **Missing `{}` placeholder** (21 instances): format string simply lacks `{}` for a
passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id`
is silently dropped
- **Spurious comma breaking C++ string literal concatenation** (2 instances): a comma
after a string literal prevents adjacent-literal concatenation, turning the
continuation into a format argument instead of part of the format string
- **Printf-style `%s` in fmtlib context** (4 instances): `%s` has no meaning in fmtlib
and appears as literal text while the argument is silently ignored
- **Extra spurious argument** (1 instance): an extraneous `t.tomb()` argument inserted
between correct arguments, causing wrong values in the wrong slots
- **Wrong variable in error message** (4 instances in `types/map.hh`): error messages
for oversized map keys/values reported `map_size` (total entry count) instead of the
actual `elem.first.size()` or `elem.second.size()` that exceeded the limit
- **Swapped argument order** (1 instance in `data_dictionary/data_dictionary.cc`):
format string says `"Extraneous options for {type}: {values}"` but the values and
type arguments were passed in reverse order
| Module | Bugs Fixed | Files |
|--------|:---------:|-------|
| `replica/` | 1 | `table.cc` |
| `service/` | 4 | `raft_group0.cc`, `storage_service.cc` |
| `db/` | 6 | `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` |
| `cql3/` | 2 | `prepare_expr.cc`, `statement_restrictions.cc` |
| `transport/` | 4 | `event_notifier.cc` |
| `sstables/` | 3 | `partition_reversing_data_source.cc`, `reader.cc` |
| `alternator/` | 1 | `conditions.cc` |
| `cdc/` | 1 | `split.cc` |
| `raft/` | 1 | `server.cc` |
| `utils/` | 2 | `gcp/object_storage.cc`, `s3/client.cc` |
| `mutation/` | 1 | `mutation_partition.hh` |
| `ent/` | 2 | `kmip_host.cc`, `kms_host.cc` |
| `types/` | 4 | `map.hh` |
| `data_dictionary/` | 1 | `data_dictionary.cc` |
The `{fmt}` library's compile-time checker validates that each `{}` placeholder
references a valid argument, but does **not** verify the reverse -- that every
argument has a corresponding placeholder. Extra arguments are silently ignored
at both compile time and runtime.
Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly.
---
**Note:** Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul".
Closesscylladb/scylladb#29448
* github.com:scylladb/scylladb:
data_dictionary: fix swapped arguments in extraneous options error
types: fix wrong variable in map key/value size error messages
ent: fix missing format placeholders in encryption error/log messages
mutation: fix spurious argument in shadowable_tombstone formatter
utils: fix missing format placeholders in object storage log messages
raft: fix missing format placeholder in server ostream operator
cdc: fix missing format placeholder in error message
alternator: fix missing format placeholder in error message
sstables: fix missing format placeholders in error messages
transport: fix printf-style format specifiers in fmtlib log calls
cql3: fix missing format placeholders in error messages
db: fix missing format placeholders in log and error messages
service: fix missing format placeholders in log messages
replica: fix missing format placeholder in cleanup log message
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.
Fixes SCYLLADB-1680
Fixes#16367
To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.
This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.
Closesscylladb/scylladb#29604
* github.com:scylladb/scylladb:
test: Stop providing alternator-streams experimental flag
alternator: Graduate Alternator Streams from experimental
Fix six format string bugs where arguments were silently dropped:
- heat_load_balance.cc: pp value was passed but had no {} placeholder.
- commitlog_replayer.cc: column_family_id was passed but table= had
no {} placeholder.
- view_update_generator.cc: _sstables_with_tables.size() was passed
but had no {} placeholder.
- view_building_worker.cc: exception pointer was passed but the
trailing colon had no {} placeholder.
- row_locking.cc: partition key and clustering key were passed in
error messages but had no {} placeholders.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.
No backport needed since this removes functionality.
Closesscylladb/scylladb#29482
* github.com:scylladb/scylladb:
db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
db/system_distributed_keyspace: remove unused code
db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
db/system_distributed_keyspace: drop old service_levels table
fix indent after the previous patch
group0: call setup_group0 only when needed
Refs: SCYLLADB-1757
Refs: SCYLLADB-1815
If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the
actual size of an entry to write, possibly causing segment size overshoot.
Break out some logic to share between this calc and new_buffer. Also remove redundant
(and possibly wrong) constant in oversized allocation.
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.
The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.
This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
The free function calculate_view_update_throttling_delay() took the
view_flow_control_delay_limit_in_ms as a parameter, which forced its
two callers (storage_proxy and view_update_generator) to fish the
option out of db::config via database::get_config(). Now that the
option lives on node_update_backlog, make the throttling calculation a
member of node_update_backlog and have the callers invoke it on their
node_update_backlog reference.
This removes two database::get_config() call sites.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Store the view_flow_control_delay_limit_in_ms config option as an
updateable_value on node_update_backlog. The value is threaded from
main.cc into the backlog object at construction time. Existing call
sites (tests) that construct node_update_backlog without the option
continue to work via a default argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Alternator Streams were experimental until 2026.2, when they became GA.
Stop requiring `--experimental-features=alternator-streams` by:
- Removing ALTERNATOR_STREAMS from the experimental feature enum
- Mapping "alternator-streams" to UNUSED for backward compatibility
- Removing the gating that disabled the ALTERNATOR_STREAMS gossip
feature when the experimental flag was absent
- Removing the runtime guard that rejected StreamSpecification requests
without the feature flag
- Updating config_test to reflect the new UNUSED mapping
The gms::feature alternator_streams is kept for rolling upgrade
compatibility with older nodes.
Fixes SCYLLADB-1680
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.
The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.
Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).
Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.
Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:
(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.
(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.
A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.
A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.
USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.
For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.
The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231.
Closesscylladb/scylladb#29310
* github.com:scylladb/scylladb:
compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
test/repair: Add tombstone GC safety tests for incremental repair
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.
This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
Because now we're limiting the range we're reading from view building
tasks table, we need to make sure that new tasks are created with larger
uuid then the `min_alive_uuid`.
In order to do it, we need to be able to see current `min_alive_uuid`
while creating new tasks.
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id
alongside the range tombstone in the same Raft batch. min_task_id is set
to min_alive_uuid so subsequent get_view_building_tasks() scans start
exactly at the first alive row, skipping all tombstoned rows.
When all tasks are deleted, min_task_id is set to a freshly generated UUID
to ensure future tasks (which will have larger timeuuids) are not skipped.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add set_min_task_id(id) which writes the min_task_id static cell to the main
"view_building" partition. The static cell is written as part of the same
mutation as the range tombstone, keeping everything in one Raft batch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a min_task_id timeuuid static column to system.view_building_tasks.
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks()
reads min_task_id first using a static-only partition slice (empty _row_ranges +
always_return_static_content). This makes the SSTable reader stop immediately
after the static row before processing any clustering tombstones, so the read
never triggers tombstone_warn_threshold warnings.
min_task_id is then used as AND id >= ? lower bound for the main task scan,
skipping all tombstoned rows below the boundary.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of issuing one row tombstone per finished task, collect all tasks
to delete, find the smallest timeuuid among alive tasks (min_alive_uuid),
then emit a single range tombstone [before_all, min_alive_uuid) covering
all tasks below that boundary. Tasks above the boundary (rare: finished
task interleaved with alive tasks) still get individual row tombstones.
When no alive tasks remain, del_all_tasks() covers the entire partition
with a single range tombstone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add del_tasks_before(id) which emits a range tombstone [before_all, id)
and del_all_tasks() which covers the entire clustering range. These will
be used by the coordinator to delete finished tasks in bulk instead of
issuing one row tombstone per task.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.
In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.
In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
- next_replication is cleared;
- new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.
If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.
Fixes: SCYLLADB-567.
No backport needed; new feature.
Closesscylladb/scylladb#24421
* github.com:scylladb/scylladb:
service: fix indentation
docs: update documentation
test: test multi RF changes
service: tasks: allow aborting ongoing RF changes
cql3: allow changing RF by more than one when adding or removing a DC
service: handle multi_rf_change
service: implement make_rf_change_plan
service: add keyspace_rf_change_plan to migration_plan
service: extend tablet_migration_info to handle rebuilds
service: split update_node_load_on_migration
service: rearrange keyspace_rf_change handler
db: add columns to system_schema.keyspaces
db: service: add ongoing_rf_changes to system.topology
gms: add keyspace_multi_rf_change feature
Deduplicate the fragment-consuming logic in
view_update_builder::generate_updates() by extracting it into three
private methods: consume_both_fragments(), consume_update_fragment(),
and consume_existing_fragment().
The three inlined blocks for cmp < 0, cmp > 0, and cmp == 0 were
identical to the trailing "update only" and "existing only" blocks.
The only semantic change is in the trailing "existing only" path: the
outer tombstone guard is replaced by per-branch tombstone checks inside
consume_existing_fragment(), which is both sufficient and more precise
for the static_row case (uses partition tombstone only, not range
tombstone which is irrelevant for static rows).
In the existing-only tail block of generate_updates(), the clustering
row and static row were extracted from the fragment using a deep copy
constructor (e.g. clustering_row(*_schema, fragment.as_clustering_row()))
even though the fragment is not used afterwards. Replace with moves,
matching the pattern used in all other cases.
Two of the three clustering-row cases in generate_updates() used
mutate_as_clustering_row() to apply a tombstone to the row in-place,
then immediately moved the row out of the fragment. This triggered an
unnecessary memory usage recalculation in the reader permit, since:
1. apply(tombstone) does not change external memory usage (tombstone
is stored inline, not heap-allocated), so the recalculation will
yield the same result.
2. The fragment is consumed on the very next line, so the tracking
window is effectively zero.
Simplify these two cases to match the first case (cmp < 0), which
already uses the simpler pattern of moving the row out of the fragment
first, then applying the tombstone on the extracted row.
Rename advance_all(), advance_updates() and advance_existings() to
read_both_next_fragments(), read_next_update_fragment() and
read_next_existing_fragment(), respectively. The new names make it
clear that these methods read the next mutation fragment from the
corresponding reader into the cached fragment member.
Also rename on_results() to generate_updates(), which better describes
its role of generating view updates from the previously read fragments.
Rename the members of view_update_builder to reflect their roles
more precisely:
_updates -> _update_reader
_existings -> _existing_reader
_update -> _update_fragment
_existing -> _existing_fragment
This makes the code easier to follow by distinguishing the readers
(which produce a stream of fragments) from the cached fragments
(the most recently read mutation_fragment_v2 from each reader).