When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
Add self_heal_service_levels_version() and use it during startup when
the node is already on raft topology but service levels are still marked
as v1.
In that stale state, migrate service levels to v2 through group0 instead
of failing startup.
migrate_to_v2() was removed after gossip-based service level migration
support was dropped, since upgraded nodes were expected to already use
service levels v2.
However, clusters affected by the old migration bug may reach raft topology
while system.scylla_local still has a stale service level version. Restore
the migration helper so startup can self-heal those nodes by writing the v2
state through group0.
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.
The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29714
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.
On each node/shard we have 2 types of view update backlogs:
1. for deciding whether we should admit writes
2. for propagating the backlog information to other nodes/shards.
For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.
The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.
We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.
That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.
We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.
Fixes: SCYLLADB-1901
Closesscylladb/scylladb#29862
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.
The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.
Fixes SCYLLADB-1343
Strong consistency is still in experimental phase, no need to backport.
Closesscylladb/scylladb#29318
* github.com:scylladb/scylladb:
test/strong_consistency: verify metrics
strong_consistency: wire up metrics to operations
strong_consistency: add stats struct and metrics registration
The mechanics of the restore is like this
- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
- First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
- Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
- Reading the snapshot_sstables table
- Filtering the read sstable infos against current node and tablet being handled
- Downloading and attaching the filtered sstables
This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.
This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)
Other follow-up items:
- have an actual swagger object specification for `backup_location`
Closes#28436Closes#28657Closes#28773Closesscylladb/scylladb#28763
* github.com:scylladb/scylladb:
docs: Update topology_over_raft.md with `restore` transition kind
test: Add test for backup vs migration race
test: Restore resilience test
sstables_loader: Fail tablet-restore task if not all sstables were downloaded
sstables_loader: mark sstables as downloaded after attaching
sstables_loader: return shared_sstable from attach_sstable
db: add update_sstable_download_status method
db: add downloaded column to snapshot_sstables
db: extract snapshot_sstables TTL into class constant
test: Add a test for tablet-aware restore
tablets: Implement tablet-aware cluster-wide restore
messaging: Add RESTORE_TABLET RPC verb
sstables_loader: Add method to download and attach sstables for a tablet
tablets: Add restore_config to tablet_transition_info
sstables_loader: Add restore_tablets task skeleton
test: Add rest_client helper to kick newly introduced API endpoint
api: Add /storage_service/tablets/restore endpoint skeleton
sstables_loader: Add keyspace and table arguments to manfiest loading helper
sstables_loader_helpers: just reformat the code
sstables_loader_helpers: generalize argument and variable names
sstables_loader_helpers: generalize get_sstables_for_tablet
sstables_loader_helpers: add token getters for tablet filtering
sstables_loader_helpers: remove underscores from struct members
sstables_loader: move download_sstable and get_sstables_for_tablet
sstables_loader: extract single-tablet SST filtering
sstables_loader: make download_sstable static
sstables_loader: fix formating of the new `download_sstable` function
sstables_loader: extract single SST download into a function
sstables_loader: add shard_id to minimal_sst_info
sstables_loader: add function for parsing backup manifests
split utility functions for creating test data from database_test
export make_storage_options_config from lib/test_services
rjson: Add helpers for conversions to dht::token and sstable_id
Add system_distributed_keyspace.snapshot_sstables
add get_system_distributed_keyspace to cql_test_env
code: Add system_distributed_keyspace dependency to sstables_loader
storage_service: Export export handle_raft_rpc() helper
storage_service: Export do_tablet_operation()
storage_service: Split transit_tablet() into two
tablets: Add braces around tablet_transition_kind::repair switch
The container image inherits kernel.core_pattern from the host. When
the host pipes core dumps to a handler (e.g. Ubuntu's apport), that
handler does not exist or work correctly inside the container, so core
dumps are silently lost.
Override any pipe-based core_pattern with a file-based pattern that
writes directly to /var/lib/scylla/coredump/. The override is attempted
both from the entrypoint (scyllasetup.coredumpSetup) and from
scylla-server.sh when running as root; it succeeds only when the
container has write access to /proc/sys/kernel/core_pattern and is
silently skipped otherwise.
Fixes: SCYLLADB-1366
Closesscylladb/scylladb#29337
Issue #8627 is fixed, so test_too_large_indexed_value_build now passes and should run normally instead of XPASSing under strict xfail.
Fixes: SCYLLADB-1938
Closesscylladb/scylladb#29853
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.
The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.
Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.
Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.
This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2
Fixes: SCYLLADB-1803
Closesscylladb/scylladb#29791
* github.com:scylladb/scylladb:
test: boost: add drain test for forced capacity-based balancing
service: allow draining with forced capacity-based balancing
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.
Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.
In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.
Fixes SCYLLADB-1869
Closesscylladb/scylladb#29797
The first patch improves the input validation of the CONTAINS operator. I believe this is not a critical fix, because RapidJSON already has exception-throwing RAPIDJSON_ASSERT() that check for unexpected JSON structure (like something we expect to be a list isn't actually a list), but it's cleaner to do these checks explicitly.
The second patch just removes an unnecessary call to format() on a constant string.
Closesscylladb/scylladb#28506
* github.com:scylladb/scylladb:
alternator: remove unneeded call to format()
alternator: improve CONTAINS operator's validity checking
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.
Fixes SCYLLADB-1353
Closesscylladb/scylladb#29804
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.
Two-part fix:
**1. Range tombstones instead of row tombstones (commits 2–3)**
Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.
**2. Bounded scan with `min_task_id` (commits 4–6)**
Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.
- Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
- On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
- On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
- Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.
The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.
The issue is not critical, so the fix shouldn't be backported.
Fixes SCYLLADB-657
Closesscylladb/scylladb#28929
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
docs: document tombstone avoidance in view_building_tasks
view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
view_building: introduce `task_uuid_generator`
view_building: store `min_alive_uuid` in view building state
view_building: set min_task_id when GC-ing finished tasks
view_building: add min_task_id support to view_building_task_mutation_builder
view_building: add min_task_id static column and bounded scan to system_keyspace
view_building: use range tombstone when GC-ing finished tasks
view_building: add range tombstone support to view_building_task_mutation_builder
view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
Seastar deprecated default-constructing input_stream and output_stream
(they are useless in that state), and also deprecated move-assigning
them after the fact.
Fix by wrapping both fields in std::optional, and using emplace() to
construct them in-place once the connected socket is available.
It would be nicer to make connect() a static method that returns
a connection, but that's a larger change.
Closesscylladb/scylladb#29627
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.
This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:
- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)
The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.
The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.
**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated
Refs: SCYLLADB-1087
Backport: new feature, no backport
Closesscylladb/scylladb#29324
* github.com:scylladb/scylladb:
sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
sstables: introduce --abort-on-malformed-sstable-error infrastructure
sstables: refactor parse_path() to return std::expected<> instead of throwing
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.
New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.
A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.
A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.
Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.
Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.
| Test Name | new test specific retry strategy execution time (ms) | original execution time (ms) | Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy | 19,261 | 61,395 | −42,134 | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy | 16,901 | 53,688 | −36,787 | **3.2×** |
| test_client_upload_file_single_part_proxy | 3,478 | 6,789 | −3,311 | **2.0×** |
| test_client_multipart_copy_upload_proxy | 1,303 | 1,619 | −316 | 1.2× |
| test_client_put_get_object_proxy | 150 | 365 | −215 | **2.4×** |
| test_client_readable_file_stream_proxy | 125 | 327 | −202 | **2.6×** |
| test_small_object_copy_proxy | 205 | 389 | −184 | 1.9× |
| test_client_put_get_tagging_proxy | 181 | 350 | −169 | 1.9× |
| test_client_multipart_upload_proxy | 1,252 | 1,416 | −164 | 1.1× |
| test_client_list_objects_proxy | 729 | 881 | −152 | 1.2× |
| test_chunked_download_data_source_with_delays_proxy | 830 | 960 | −130 | 1.2× |
| test_client_readable_file_proxy | 148 | 279 | −131 | 1.9× |
| test_client_upload_file_multi_part_with_remainder_minio | 3,358 | 3,170 | +188 | 0.9× |
| test_client_upload_file_multi_part_without_remainder_minio | 3,131 | 2,929 | +202 | 0.9× |
| test_client_upload_file_single_part_minio | 519 | 421 | +98 | 0.8× |
| test_download_data_source_proxy | 180 | 237 | −57 | 1.3× |
| test_client_list_objects_incomplete_proxy | 590 | 641 | −51 | 1.1× |
| test_large_object_copy_proxy | 952 | 991 | −39 | 1.0× |
| test_client_multipart_upload_fallback_proxy | 148 | 185 | −37 | 1.3× |
| test_client_multipart_copy_upload_minio | 641 | 674 | −33 | 1.1× |
No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.
Closesscylladb/scylladb#29508
* github.com:scylladb/scylladb:
test: extract object storage helpers to test/pylib/object_storage.py
test: add per-test bucket isolation to object_store fixtures
s3: add client::make overload with custom retry strategy
test: add s3_test_fixture and migrate tests to per-bucket isolation
s3: add create_bucket and delete_bucket to client
s3_storage::make_source previously ignored its file f parameter and
constructed a fresh s3::client::readable_file per call. The new
file's _stats cache was empty, so the first dma_read_bulk issued a
HEAD via maybe_update_stats just to learn the object size before
the ranged GET -- one ~50 ms RTT per uncached read.
The file f passed in by the two callers (sstable::data_stream for
Data.db reads and index_reader::make_context for Index.db reads)
already wraps the sstable's _data_file or _index_file. Those file
objects had their stats populated at sstable open time by
update_info_for_opened_data, and they were wrapped with the
configured file_io_extensions when opened via open_component. Reusing
them is exactly what filesystem_storage::make_source does (one-line
make_file_data_source over f), so the s3 path simply matches it.
readable_file::size() is also updated to route through
maybe_update_stats(), so a .size() call populates the _stats cache
the same way .stat() does -- preventing a redundant HEAD on the
first subsequent read of components opened with .size() (Index,
Partitions, Rows in update_info_for_opened_data).
Closesscylladb/scylladb#29766
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Add some text about how the new transition works. It doesn't include
full feature description, just concentrates on the new transition and
the way it interacts with the rest of topology coordinator machinery.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:
- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
some tablets should fail to get restored. If the coordinator is lost,
it triggers coordinator re-election and new coordinator still notices
that a tablet that was replicated to "old" coordinator failed to get
restored and fails the restore anyway
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.
Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
After each SSTable is successfully attached to the local table in
download_tablet_sstables(), update its downloaded status in
system_distributed.snapshot_sstables to true. This enables tracking
restore progress by counting how many SSTables have been downloaded.
Change attach_sstable() return type from future<> to
future<sstables::shared_sstable>, returning the SSTable that was
attached. This will be used to extract the SSTable identifier and
first token for updating the download status.
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.
Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds
- Changes in sstables_loader::restore_tablets() method
It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest
- Implementation of tablet_restore_task_impl::run() method
It emplaces a bunch of tablet migrations with "restore" kind
- Topology coordinator handling of tablet_transition_stage::restore
When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The topology coordinator will need to call this verb against existing
tablet replicas to ask them restore tablet sstables. Here's the RPC verb
to do it.
It now returns an empty restore_result to make it "synchronous" -- the
co_await send_restore_tablets() won't resolve until client call
finishes.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Extracts the data from snapshot_sstables tables and filters only
sstables belonging to current node and tablet in question, then starts
downloading the matched sstables
Extracted from Ernest PR #28701 and piggy-backs the refactoring from
another Ernest PR #28773. Will be used by next patches.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The new cluster-wide tablets restore API is going to be asynchronous,
just like existing node-local one is. For that the task_manager tasks
will be used.
This patch adds a skeleton for tablets-restore task with empty run
method. Next patches will populate it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Withdrawn from #28701. The endpoint implementation from the PR is going
to be reworked, but the swagger description and set/unset placeholders
are very useful.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.
This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Rename arguments and local variables in get_sstables_for_tablet to avoid
references to SSTable-specific terminology. This makes the function more
generic and better suited for reuse with different range types.
Generalize get_sstables_for_tablet by templating the return type so it
produces vectors matching the input range’s value type. This makes the
function more flexible and prepares it for reuse in tablet‑aware
restore.
Add getters for the first and last tokens in get_sstables_for_tablet to
make the function more generic and suitable for future use in the
tablet-aware restore code.
Move the download_sstable and get_sstables_for_tablet static functions
from sstables_loader into a new file to make them reusable by the
tablet-aware restore code.
Extract single-tablet range filtering into a new
get_sstables_for_tablet function, taken from the existing
get_sstables_for_tablets. This will later be reused in the
tablet-aware restore code.
Extract the logic for downloading a single SST into a dedicated
function and reuse it in download_fully_contained_sstables. This
supports upcoming changes that consolidate common code.
Add a shard_id member to the minimal_sst_info struct as part of the
tablet-aware restore refactoring. This will support upcoming changes
that extract common code.
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
keyspace text, table text,
datacenter text, rack text,
id uuid,
first_token bigint, last_token bigint,
toc_name text, prefix text)
PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.
Fixes SCYLLADB-263
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
The loader will need to populate and read data from
system_distributed.snapshot_sstables table added recently, so this
dependency is truly needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Next patches will introduce an RPC handler to restore a tablet on
replica. The handler will be registered by sstables_loader, and it will
have to call that helper from storage_service which thus needs to be
moved to public scope.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The goal of the split is to have try_transit_tablet() that
- doesn't throw if tablet is in transition, but reports it back
- doesn't wait for the submitted transition to finish
The user will be in tablet-aware-restore, it will call this new trying
helper in parallel, then wait for all transitions to finish.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
- Add `large_rows_exceeding_threshold`, `large_cell_exceeding_threshold`, and `large_collection_exceeding_threshold` metrics to complement the existing `large_partition_exceeding_threshold`
- Add unit tests verifying stats counters increment correctly during SSTable writes
Backport is not needed
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1095Closesscylladb/scylladb#29722
* github.com:scylladb/scylladb:
test/boost: add tests for large data stats counters
db: add large data metrics for rows, cells, and collections
Replace the local buffer_data_sink_impl and buffer_data_source_impl
classes in create_memory_sink() and create_memory_source() with
seastar::util::memory_data_sink and seastar::util::memory_data_source
respectively, which are now available upstream.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29616
Add test_large_data_stats_large_rows, test_large_data_stats_large_cells,
and test_large_data_stats_large_collections to verify that the
large_data_handler stats counters are correctly incremented during
SSTable writes and that unrelated counters remain at zero.
Previously only large_partition_exceeding_threshold was exposed as a
metric. Add three new counters to large_data_handler::stats and register
corresponding Prometheus metrics:
- large_rows_exceeding_threshold
- large_cell_exceeding_threshold
- large_collection_exceeding_threshold
The counters are incremented in maybe_record_large_rows() and
maybe_record_large_cells() following the same pattern used by the
existing partition metric.
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.
Fixes SCYLLADB-1932
Closesscylladb/scylladb#29834
Fixes: SCYLLADB-1693
In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.
This change moves it to after raft_decommission.
In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status
In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.
In the case of backup, this operates on a snapshot, state of which
is not affected.
Adds an injection point for testing.
v2:
- Added injection point to ensure test can abort decommission
Closesscylladb/scylladb#29667
6165124fcc has left statement_restrictions.cc scarred and
deformed. Restore it to standard 4-space indentation. This patch
contains only whitespace changes.
Closesscylladb/scylladb#29598
When a topology command (e.g., rebuild) fails on a target node, the
exception message was being swallowed at multiple levels:
1. raft_topology_cmd_handler caught exceptions and returned a bare
fail status with no error details.
2. exec_direct_command_helper saw the fail status and threw a generic
"failed status returned from {id}" message.
3. The rebuilding handler caught that and stored a hardcoded
"streaming failed" message.
This meant users only saw "rebuild failed: streaming failed" instead
of the actionable error from the safety check (e.g., "it is unsafe
to use source_dc=dc2 to rebuild keyspace=...").
Fix by:
- Adding an error_message field to raft_topology_cmd_result (with
[[version 2026.2]] for wire compatibility).
- Populating error_message with the exception text in the handler's
catch blocks.
- Including error_message in the exception thrown by
exec_direct_command_helper.
- Passing the actual error through to rtbuilder.done() instead of
the hardcoded "streaming failed".
A follow-up test is in https://github.com/scylladb/scylladb/pull/29363
Fixes: SCYLLADB-1404
Closesscylladb/scylladb#29362
Add explicit empty permissions block (permissions: {}) since this
workflow only syncs milestones to Jira using its own secrets and needs
no GITHUB_TOKEN permissions. Fixes code scanning alert #171.
Closesscylladb/scylladb#29184
When finalizing a tablet split, all data must have been moved into
split-ready compaction groups before the storage groups can be remapped
to the new tablet count. If split-unready groups still hold data at that
point, handle_tablet_split_completion() calls on_internal_error(), which
previously only reported the tablet and table IDs — giving no insight
into why the split-unready groups were not empty.
Add fmt::formatter specializations for compaction_group and storage_group
so the full state of the offending storage_group is included in the error
message. The storage_group formatter emits:
main=<cg>, merging=[<cg>...], split_ready=[<cg>...]
Each compaction_group formatter emits:
[sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>]
where sstable_desc includes filename, origin, identifier and originating
host, memtable_empty reflects whether all memtables have been flushed,
and sstable_add_gate count reveals whether an in-flight sstable add is
holding data in the group.
Supporting changes:
- compaction_group: add memtable_empty() const noexcept (delegates to
memtable_list::empty()) and a const overload of sstable_add_gate()
so both are accessible from a const compaction_group reference inside
the formatter.
- Promote sstable_desc from a local lambda in compaction_group_for_sstable
to a static free function so it is reusable by the formatter.
Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29178
For ms-format (trie-based) sstables, the traditional summary structure
is not populated. Instead, read equivalent metadata from the
_partitions_db_footer field: first_key, last_key, partition_count,
and trie_root_position.
This is a follow-up to the crash fix for SCYLLADB-1180, replacing the
informational-only message with actual useful output.
Refs: SCYLLADB-1180
Closesscylladb/scylladb#29164
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".
Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).
The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.
Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.
test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.
Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.
Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").
fixes#10366fixes#12247
Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>
Closesscylladb/scylladb#23433
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.
Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.
Fixes#13241
Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>
Closesscylladb/scylladb#18407
This PR adds the `fulltext_index` custom index class, laying the groundwork for full-text search in ScyllaDB. It focuses on the CQL-facing layer - schema validation, option parsing, and metadata - without implementing the search backend itself.
Users can now write:
```cql
CREATE CUSTOM INDEX ON t(content) USING 'fulltext_index'
WITH OPTIONS = {'analyzer': 'english', 'positions': 'false'};
```
The implementation follows the same custom index pattern established by vector search: a `custom_index` subclass registered in the factory map, with no backing materialized view. This keeps the door open for a CDC-based indexing pipeline similar to the one vector search uses.
As part of this work, the option validation helpers (`validate_enumerated_option`, `validate_positive_option`, `validate_factor_option`) were extracted from `vector_index.cc` into a shared header so both index types can reuse them. The `custom_index` base class also gained a virtual `index_type_name()` method, giving each subclass a self-describing name for error messages without hardcoding strings in shared code.
The PR is split into three commits:
1. Extract shared validation utilities and add `index_type_name()` to `custom_index`
2. Implement `fulltext_index` with column type and option validation
3. Integration tests covering creation, validation, describe, and metadata
Fixes: SCYLLADB-1517
Fixes: SCYLLADB-1510
References: SCYLLADB-1516
Closesscylladb/scylladb#29658
* github.com:scylladb/scylladb:
test/cqlpy: add integration tests for `fulltext_index`
index: unify custom index description
index: add `fulltext_index` custom index implementation
index: extract option validation helpers
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.
Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756
Backport: no, it's a minor fix and UDFs are experimental feature in Scylla
Closesscylladb/scylladb#29717
* github.com:scylladb/scylladb:
test/cqlpy: add test for EXECUTE permission on UDA sub-functions
cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.
This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.
For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647
Backport is required, as it is a bug fix that was introduced in 517a4dc4df.
Closesscylladb/scylladb#29599
* github.com:scylladb/scylladb:
sstables_loader: synchronously unlink streamed sstables before returning
sstables: make sstable::unlink() idempotent
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.
Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."
Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358Closesscylladb/scylladb#29821
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature
Closesscylladb/scylladb#29659
* github.com:scylladb/scylladb:
db, sstables: add node_owner to sstables registry primary key
db, sstables: rename sstables registry column owner to table_id
Replace the two remaining direct 'throw bufsize_mismatch_exception(...)'
call sites with the new throw_bufsize_mismatch_exception() helper, which
routes through throw_malformed_sstable_exception() and thus also respects
the --abort-on-malformed-sstable-error flag.
Affected files:
- sstables/sstables.cc (1 site, in check_buf_size())
- sstables/m_format_read_helpers.cc (1 site, in check_buf_size())
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
Both functions now check abort_on_malformed_sstable_error() first. If
set, they log the error and call std::abort() directly, generating a
coredump. Otherwise they fall through to the existing on_internal_error()
path, which is in turn controlled by --abort-on-internal-error.
Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after
seastar::testing::scoped_no_abort_on_internal_error) and use it in all
tests that intentionally corrupt sstables and expect
malformed_sstable_exception to be thrown rather than the process aborting.
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.
The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
helper functions in sstables/sstables.cc, which check the new flag and
either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
wired up in main.cc alongside abort_on_internal_error.
Call-site migration will follow in subsequent commits.
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).
Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.
Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.
Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
Verify that SELECT of a UDA requires EXECUTE on its SFUNC, FINALFUNC,
and REDUCEFUNC individually. If any one permission is missing, the
query must be rejected at planning time (even on an empty table).
The test is parameterized over the three sub-functions and uses
Lua on Scylla or Java on Cassandra, so it runs on both backends.
The REDUCEFUNC case is skipped on Cassandra since REDUCEFUNC is a
Scylla extension.
Refs SCYLLADB-1756
The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.
docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"
Fixes#29662Closes#29663
Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller. Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.
This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.
Components' dependencies cleanup, not backporting.
Closesscylladb/scylladb#29636
* github.com:scylladb/scylladb:
storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
alternator: Use shared updateable_timeout_config by reference
cql_transport: Use shared updateable_timeout_config by reference
storage_proxy: Use shared updateable_timeout_config by reference
main: Introduce sharded<updateable_timeout_config>
storage_proxy: Keep own updateable_timeout_config
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.
Code dependencies refactoring, not backporting
Closesscylladb/scylladb#29635
* github.com:scylladb/scylladb:
view: Turn calculate_view_update_throttling_delay into node_update_backlog member
view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
view: Add node_update_backlog reference to view_update_generator
Stop reading maintenance_mode through replica::database's db::config.
Add a properly typed maintenance_mode_enabled field to
storage_proxy::config, populate it in main.cc from cfg->maintenance_mode()
(same as messaging_service::config), and use a cached member in
storage_proxy instead of db.local().get_config().maintenance_mode().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Closesscylladb/scylladb#29637
Add .set_skip_when_empty() to compaction_manager::validation_errors.
This metric only increments when scrubbing encounters out-of-order or
invalid mutation fragments in SSTables, indicating data corruption.
It is almost always zero and creates unnecessary reporting overhead.
AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Closesscylladb/scylladb#29349
Add a --time-trace flag to configure.py and a Scylla_TIME_TRACE CMake
option that enable Clang's -ftime-trace on all C++ compilations. When
enabled, each .o file produces a companion .json trace that can be
analyzed with ClangBuildAnalyzer or loaded in chrome://tracing to
identify slow headers and costly template instantiations.
This is the first step toward data-driven build speed improvements.
Refs #1
Usage:
configure.py: ./configure.py --time-trace --mode dev
CMake: cmake -DScylla_TIME_TRACE=ON -DCMAKE_BUILD_TYPE=Dev ..
Closesscylladb/scylladb#29462
rust/**/target and Cargo.lock files under rust/inc/ and
rust/wasmtime_bindings/ were not ignored, nor was
test/resource/wasm/rust/target/.
Closesscylladb/scylladb#28943
Fix 28 format string bugs plus 5 related format argument bugs across 14 modules
where `{}` placeholders were missing or arguments were wrong, causing arguments to
be silently dropped or misleading output from the `{fmt}` library.
Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single
instance in `replica/table.cc`), a comprehensive audit of the entire codebase was
performed to find all similar issues.
- **Missing `{}` placeholder** (21 instances): format string simply lacks `{}` for a
passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id`
is silently dropped
- **Spurious comma breaking C++ string literal concatenation** (2 instances): a comma
after a string literal prevents adjacent-literal concatenation, turning the
continuation into a format argument instead of part of the format string
- **Printf-style `%s` in fmtlib context** (4 instances): `%s` has no meaning in fmtlib
and appears as literal text while the argument is silently ignored
- **Extra spurious argument** (1 instance): an extraneous `t.tomb()` argument inserted
between correct arguments, causing wrong values in the wrong slots
- **Wrong variable in error message** (4 instances in `types/map.hh`): error messages
for oversized map keys/values reported `map_size` (total entry count) instead of the
actual `elem.first.size()` or `elem.second.size()` that exceeded the limit
- **Swapped argument order** (1 instance in `data_dictionary/data_dictionary.cc`):
format string says `"Extraneous options for {type}: {values}"` but the values and
type arguments were passed in reverse order
| Module | Bugs Fixed | Files |
|--------|:---------:|-------|
| `replica/` | 1 | `table.cc` |
| `service/` | 4 | `raft_group0.cc`, `storage_service.cc` |
| `db/` | 6 | `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` |
| `cql3/` | 2 | `prepare_expr.cc`, `statement_restrictions.cc` |
| `transport/` | 4 | `event_notifier.cc` |
| `sstables/` | 3 | `partition_reversing_data_source.cc`, `reader.cc` |
| `alternator/` | 1 | `conditions.cc` |
| `cdc/` | 1 | `split.cc` |
| `raft/` | 1 | `server.cc` |
| `utils/` | 2 | `gcp/object_storage.cc`, `s3/client.cc` |
| `mutation/` | 1 | `mutation_partition.hh` |
| `ent/` | 2 | `kmip_host.cc`, `kms_host.cc` |
| `types/` | 4 | `map.hh` |
| `data_dictionary/` | 1 | `data_dictionary.cc` |
The `{fmt}` library's compile-time checker validates that each `{}` placeholder
references a valid argument, but does **not** verify the reverse -- that every
argument has a corresponding placeholder. Extra arguments are silently ignored
at both compile time and runtime.
Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly.
---
**Note:** Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul".
Closesscylladb/scylladb#29448
* github.com:scylladb/scylladb:
data_dictionary: fix swapped arguments in extraneous options error
types: fix wrong variable in map key/value size error messages
ent: fix missing format placeholders in encryption error/log messages
mutation: fix spurious argument in shadowable_tombstone formatter
utils: fix missing format placeholders in object storage log messages
raft: fix missing format placeholder in server ostream operator
cdc: fix missing format placeholder in error message
alternator: fix missing format placeholder in error message
sstables: fix missing format placeholders in error messages
transport: fix printf-style format specifiers in fmtlib log calls
cql3: fix missing format placeholders in error messages
db: fix missing format placeholders in log and error messages
service: fix missing format placeholders in log messages
replica: fix missing format placeholder in cleanup log message
Build artifacts are currently scattered across
build/dist/$mode/redhat/, tools/python3/build/, tools/cqlsh/build/, etc. with unpredictable names. Add a new 'collect-dist' ninja target that
gathers all distributable artifacts into a well-known structure:
build/$mode/dist/rpm/ -- all binary RPMs (no SRPMs)
build/$mode/dist/deb/ -- all .deb packages
build/$mode/dist/tar/ -- relocatable tarballs (already here)
The collection is done via a reusable 'collect_pkgs' ninja rule defined
directly in configure.py, which knows all the source paths. No external
script is needed.
Fixes: SCYLLADB-75
Closesscylladb/scylladb#29475
configure.py creates compile_commands.json in the root directory as a
symbolic link to the file in one of the build directories. If the file
already exists it does nothing.
However it may happen that the file exists but the target file does not
exist. For example, if the build directory is removed and then building
with a different mode. Then the file will remain as a stale symbolic
link.
To address this, when the file exists check also if it's a valid
symbolic link. If not, then recreate it with a valid target.
Closesscylladb/scylladb#29680
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.
Refs SCYLLADB-463
Closesscylladb/scylladb#29697
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.
Fixes SCYLLADB-1680
Fixes#16367
To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.
This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.
Closesscylladb/scylladb#29604
* github.com:scylladb/scylladb:
test: Stop providing alternator-streams experimental flag
alternator: Graduate Alternator Streams from experimental
Migrate vector search (ANN ordered select query) CQL tests from C++/Boost suite to pytest.
This migration includes:
- New pytest tests in `test/cqlpy/test_vector_search_with_vector_store_mock.py`
- VectorStoreMock server as pytest fixture to simulate vector store responses
The benefits of this migration are:
- Extended test coverage to verify CQL protocol serialization and driver
- Reduced overall test time (no compilation required for pytest)
Fixes SCYLLADB-695
No backport needed as this is a refactoring.
Closesscylladb/scylladb#29593
* github.com:scylladb/scylladb:
vector_search: test: migrate paging warnings tests to Python
vector_search: test: migrate local_vector_index to Python
vector_search: test: migrate vector_index_with_additional_filtering_column to Python
vector_search: test: migrate cql_error_contains_http_error_description to Python
vector_search: test: migrate pk in restriction test to Python
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.
`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`
The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.
Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.
Fixes: SCYLLADB-1664
This fix needs to be backported to versions: 2025.4, 2026.1
Closesscylladb/scylladb#29585
* github.com:scylladb/scylladb:
test: verify load balancer handles dropped tables gracefully
tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.
On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.
On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.
Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).
Tests updated to cover the new disable and re-enable behavior.
Fixes#7239
Fixes SCYLLADB-523
Closesscylladb/scylladb#29413
* github.com:scylladb/scylladb:
alternator/streams: remove dead next_iter in get_records
test/alternator: fix stream wait timeouts to use wall-clock time
docs/alternator: document stream disable/re-enable behavior
alternator/streams: keep disabled streams usable and purge on re-enable
Introduce `read_from_collection_cell_view()` which reads a `collection_mutation` directly from the IDL representation of a collection (`ser::collection_cell_view`). This cuts down the number of allocations required drastically compared to the current method of:
IDL -> collection_mutatio_description -> collection_mutation
Reduces the number of allocations to unfreeze a collection from O(collection_cell_count) -> O(1) (actually, due to buffer fragmentation, it is O(collection_size)).
The new method is used when unfreezing frozen mutations and frozen mutation fragments. This is on the hot path: all writes with collections benefit.
Add a `--collection` flag to `perf-simple-query` to allow measuring the performance improvement of this PR.
With `dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write` the number of allocations drop from ~123 to 102, which is a significant amount of allocations shaved off.
Refs: https://github.com/scylladb/scylladb/issues/3602 (solves one use-case out of the many listed therein)
Fixes: SCYLLADB-1046
Fixes: SCYLLADB-1077
Backport: this is an optimization so normally not a backport candidate, but we may have to backport to relieve certain customers
Closesscylladb/scylladb#29033
* github.com:scylladb/scylladb:
test/perf/perf_simple_query: add --collection=N
test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections
mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
mutation/collection_mutation: introduce read_from_collection_cell_view()
mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
mutation/collection_mutation: generalize serialize_collection_mutation
mutation/mutation_partition_view: avoid copying collection
mutation/mutation_partition_view: accept collection_mutation in the consume API
partition_builder: add move variant of accept_*_cell() collection overloads
Copilot who review the implementation of the CONTAINS operator
complained that in some places we assume without checking that the
user-providing parameter to CONTAINS has the expected structure.
Not doing all the checks explicitly is actually not terrible in
RapidJSON, because its methods like BeginMembers() always validate the
type before trying to follow a pointer, throwing an exception if it
the JSON value doesn't have the right type. But it's still cleaner
to do these checks explicitly, and throw a clean SerializationError
instead of some internal server error. So this is what this patch does.
If the malformed object doesn't come from the query but rather comes
from the data, we just silently return false. This is our usual
convention - we don't expect malformed data in our database, but if
we do have some (see issue #8070) we shouldn't tell the user that
there was an error in his completely valid query.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The format string says "Extraneous options for {type}: {values}"
but the arguments were passed in the wrong order (values first, type
second), producing misleading error messages like
"Extraneous options for bucket,endpoint: S3" instead of
"Extraneous options for S3: bucket,endpoint".
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Four error messages for oversized map keys/values reported map_size
(the total number of entries) instead of the actual key or value size
that exceeded the limit. The condition checks elem.first.size() or
elem.second.size(), but the error message printed map_size. This
affects both the bytes and managed_bytes serialization overloads.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix two format string bugs:
- kmip_host.cc: cmd_in was passed as an argument to a trace log but
had no {} placeholder, so the command was silently dropped.
- kms_host.cc: the XML node name (what) was passed to the error
message but had no {} placeholder, so the error never showed which
XML node was missing.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The formatter for shadowable_tombstone had a spurious t.tomb()
argument between the timestamp and deletion_time arguments. This
caused t.tomb() (the whole tombstone) to be formatted into the
deletion_time={} slot, while the actual deletion_time count was
silently dropped. Remove the extra argument.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix two format string bugs:
- gcp/object_storage.cc: _session_path was passed but the format
string had empty parentheses () instead of ({}), so the session
path was silently dropped from the debug output.
- s3/client.cc: part_number was passed as an argument but had no {}
placeholder. The upload_id ended up in the etag slot and was
silently dropped. Add {} for all three values.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The FSM state was passed as an argument but the format string had
empty parentheses () instead of ({}), causing the FSM state to be
silently dropped from the output.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The collection type name was passed as an argument but the format
string only had a trailing colon without a {} placeholder, so the
type name was silently dropped from the error message.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The values count was passed as an argument but had no {} placeholder,
so it was silently dropped. The analogous BETWEEN check on the line
above correctly uses {} -- apply the same pattern here.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix three format string bugs:
- partition_reversing_data_source.cc: _row_start was passed as an
argument but had no {} placeholder in the invariant error message.
Add {} for all three values to show the full diagnostic.
- reader.cc: two "Invalid boundary type" error messages passed the
type value as an argument but had no {} placeholder, so the actual
invalid type was never shown.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Four logger calls used %s (printf-style) instead of {} (fmtlib-style),
causing __func__ to be silently ignored and the literal text "%s" to
appear in the log output. The same file already uses {} correctly in
the on_create_function and on_create_aggregate handlers.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix two format string bugs where arguments were silently dropped:
- prepare_expr.cc: the bad argument to count() was passed but had no
{} placeholder, so users never saw what was actually passed.
- statement_restrictions.cc: the unsupported multi-column relation was
passed but the trailing colon had no {} placeholder.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix six format string bugs where arguments were silently dropped:
- heat_load_balance.cc: pp value was passed but had no {} placeholder.
- commitlog_replayer.cc: column_family_id was passed but table= had
no {} placeholder.
- view_update_generator.cc: _sstables_with_tables.size() was passed
but had no {} placeholder.
- view_building_worker.cc: exception pointer was passed but the
trailing colon had no {} placeholder.
- row_locking.cc: partition key and clustering key were passed in
error messages but had no {} placeholders.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Fix four format string bugs:
- raft_group0.cc: the exception from sleep_and_abort was passed as an
argument but had no {} placeholder, so it was silently dropped.
- storage_service.cc: loading topology trace was missing a placeholder
for the cleanup field (9 args but only 8 placeholders).
- storage_service.cc: two join-rejection warnings had a spurious comma
after the first string literal, breaking C++ string concatenation.
This caused the continuation string to be treated as a separate
format argument instead of being part of the format string, and
params.host_id was silently dropped.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The log message for tablet cleanup invalidation was missing a {}
placeholder for the table name (cf_name), causing it to be silently
dropped from the output. Add {}.{} to show both keyspace and table
name, consistent with the convention used elsewhere in the file.
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.
Fix this by adding the co_await that was missing.
Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.
Closesscylladb/scylladb#29806
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.
No backport needed since this removes functionality.
Closesscylladb/scylladb#29482
* github.com:scylladb/scylladb:
db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
db/system_distributed_keyspace: remove unused code
db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
db/system_distributed_keyspace: drop old service_levels table
fix indent after the previous patch
group0: call setup_group0 only when needed
interval switched from std::optional<> to union + bools for bound
storage in 42d7ae1082.
Update the printer to work with the new layout. Keep the code
backwards compatible, 2025.1 still uses optionals and is still
supported.
Closesscylladb/scylladb#29738
Fix 4 genuine CQL syntax errors in documentation examples, found by automated extraction and execution of doc code blocks against a live ScyllaDB instance.
- **insert.rst**: `USING TTL 86400 IF NOT EXISTS` → `IF NOT EXISTS USING TTL 86400` (wrong clause order produces syntax error)
- **ddl.rst**: Missing opening quote in ALTER KEYSPACE example (`dc2'` → `'dc2'`)
- **ddl.rst**: Hyphenated column names need double-quoting; also fix PRIMARY KEY referencing non-existent `customer_id` instead of `cust_id`
- **types.rst**: UDT `address` contains nested collections, so it must be `frozen<address>` when used as a column type
Built a CQL extractor that parses `.. code-block:: cql` blocks from RST docs, then executed all 194 extracted statements against ScyllaDB 2026.2.0-rc0. These 4 are confirmed syntax/semantic errors in the documentation.
Closesscylladb/scylladb#29765
* github.com:scylladb/scylladb:
test/cqlpy: add tests for hyphenated column names
docs/cql: fix UDT example to use frozen<address>
docs/cql: fix CREATE TABLE example with hyphenated column names
docs/cql: fix missing opening quote in ALTER KEYSPACE example
docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
Prepare for truncate of tables on object storage, where we want to
limit the atomic deletion batches to produce smaller batch mutations.
This is safe since truncate does not really need to delete all sstables
in the table atomically — it is already non-atomic since each node and
each shard deletes its own sstables. The atomic deletion mechanism is
used for convenience.
Previously, discard_sstables collected all sstables from all compaction
groups on the shard into a single vector and issued one atomic delete
for all of them. Change to track removed sstables per compaction group
and issue separate atomic deletes per group using
coroutine::parallel_for_each, allowing concurrent deletion across
groups.
Closesscylladb/scylladb#29789
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.
The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node
Fixes SCYLLADB-1885
Closesscylladb/scylladb#29798
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.
Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.
Fixes SCYLLADB-1756
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.
Three sites are changed:
1. wait_for_entry (truncation case): the snapshot-term match optimization
that proved the entry was committed now applies to both wait_type::committed
and wait_type::applied, not just committed.
2. wait_for_entry (snapshot covers entry): instead of throwing
commit_status_unknown when the snapshot index >= entry index, return
successfully. The entry's effects are included in the state machine's
state via the snapshot.
3. drop_waiters: when called from load_snapshot, pass the snapshot term.
Waiters whose term matches the snapshot term are resolved successfully
(set_value) instead of failing with commit_status_unknown, since the
Log Matching Property guarantees they were committed and included.
This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.
The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.
Fixes: SCYLLADB-1264
No backport: the changes are quite risky and the test being fixed
fails very rarely.
Closesscylladb/scylladb#29685
* github.com:scylladb/scylladb:
test/raft: fix duplicate check in connected::operator()
test/raft: add tests for add_entry snapshot interactions
raft: do not throw commit_status_unknown from add_entry when possible
raft: change drop_waiters parameter from index to snapshot descriptor
raft: server: fix a typo
Add `test_fulltext_index.py` covering the `fulltext_index` custom index:
- Creation on text, varchar, and ascii columns
- Rejection of non-text types (int, blob, vector)
- Validation of analyzer and positions options
- Rejection of unsupported option keys
- Case-insensitive class name lookup
- DESCRIBE INDEX output with and without options
- No backing materialized view in `system_schema.views`
- IF NOT EXISTS idempotent behavior
- Metadata correctness in `system_schema.indexes`
Move common description logic into a protected helper
`describe_with_target` on `custom_index`, so subclasses can delegate
to it when implementing the `describe()` virtual method.
Introduce `fulltext_index`, a new `custom_index` subclass
for full-text search (FTS).
The index validates that the target column is a text type
(text, varchar, or ascii) and supports two WITH OPTIONS keys:
- 'analyzer': one of standard, english, german, french, spanish,
italian, portuguese, russian, chinese, japanese, korean, simple,
whitespace
- 'positions': boolean controlling whether term positions are stored
`view_should_exist()` returns false — no backing materialized view is
created, matching the CDC-backed pattern used by `vector_index`.
Fixes: SCYLLADB-1517
Move `validate_enumerated_option`, `validate_positive_option`,
and `validate_factor_option` into shared index option utilities
under the `secondary_index::util` namespace. These functions were
previously defined as file-local statics in `vector_index.cc` with
hardcoded index names in error messages.
The shared versions take `index_type_name` as a parameter, allowing
each `custom_index` subclass to pass its own name via the virtual
`index_type_name()` method at the call site. The options maps use
`std::bind_front` to bind config params (supported values, limits),
leaving `index_name` as the first unbound argument passed by
`check_index_options()`.
Add `index_type_name()` as a pure virtual method on `custom_index`.
Move the shared utility implementations into `index_option_utils.cc`
and update `vector_index.cc` to use them.
The operator had a copy-paste bug: it checked
disconnected.contains({id1, id2}) twice instead of checking both
directions ({id1, id2} and {id2, id1}).
Reduce the operator to a single directional check: {id1, id2}. It works
for all current callers, and checking both directions correctly would
break the new block_receive() function.
Add six tests covering add_entry with wait_type::applied and
wait_type::committed for three snapshot scenarios affected in the
previous commit:
1. Snapshot at the entry's index (wait_for_entry, term_for returns
snapshot term).
2. Snapshot past the entry's index (wait_for_entry, term_for returns
nullopt).
3. Follower's waiter is resolved via drop_waiters when a snapshot
is loaded.
Without the fix in the previous commit, 4 of 6 tests fail:
all 3 wait_type::applied tests and the wait_type::committed
drop_waiters test. The remaining two tests pass because the changes
don't affect them.
We don't write tests covering the scenarios when add_entry should
still throw commit_status_unknown (that is when the entry's term
doesn't match the snapshot's term) because:
- these tests would be very complicated,
- a bug that would make these tests fail should also make the
nemesis tests fail, as there would be an issue with linearizability.
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.
Three sites are changed:
1. wait_for_entry (truncation case): the snapshot-term match optimization
that proved the entry was committed now applies to both wait_type::committed
and wait_type::applied, not just committed.
2. wait_for_entry (snapshot covers entry): instead of throwing
commit_status_unknown when the snapshot index >= entry index, return
successfully. The entry's effects are included in the state machine's
state via the snapshot.
3. drop_waiters: when called from load_snapshot, pass the snapshot term.
Waiters whose term matches the snapshot term are resolved successfully
(set_value) instead of failing with commit_status_unknown, since the
Log Matching Property guarantees they were committed and included.
This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.
The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.
Fixes: SCYLLADB-1264
Change drop_waiters(std::optional<index_t> idx) to
drop_waiters(const snapshot_descriptor* snp). The only caller that passes
an index is load_snapshot, which already has the full snapshot descriptor.
Using it directly makes the parameter self-documenting and prepares for the
following commit which will also need the snapshot term (a field of
snapshot_descriptor).
Several sstable_compaction_test cases run prohibitively slowly on S3 and GCS backends — some taking 4+ minutes — because they create hundreds of SSTables sequentially over high-latency HTTP connections and perform redundant validation (checksumming) round-trips on every one. The twcs_reshape_with_disjoint_set S3 variant was even disabled entirely because of this.
The changes apply three complementary optimizations, per-test:
**Skip SSTable validation on remote storage.** The compaction tests verify strategy logic, not data integrity. SSTable validation triggers additional read-back I/O which is cheap on local disk but expensive over HTTP. A `do_validate` flag now conditionally skips validation when the storage backend is not local.
**Parallelize SSTable creation with async coroutines.** A new `make_sstable_containing_async` coroutine overload is added alongside the existing synchronous `make_sstable_containing`. Sequential creation loops are replaced with `parallel_for_each` using coroutine lambdas that call the async overload directly, overlapping S3/GCS uploads without spawning a dedicated Seastar thread per SSTable. The async validation path performs the same content checks as the synchronous version (mutation merging and `is_equal_to_compacted` assertions). Operations that depend on the created SSTables (e.g. `add_sstable_and_update_cache`, `owned_token_ranges` population) remain sequential.
**Reduce SSTable count for remote variants.** Tests like twcs_reshape_with_disjoint_set and stcs_reshape_overlapping used a hardcoded count of 256. The count is now a function parameter (default 256 for local, 64 for S3/GCS), which is sufficient to exercise the compaction strategy logic while avoiding excessive remote I/O.
Infrastructure changes: S3 endpoint max_connections raised from the default to 32 to support the higher upload concurrency, and trace-level logging added for s3, gcp_storage, http, and default_http_retry_strategy to aid future debugging.
The previously disabled twcs_reshape_with_disjoint_set_s3_test is re-enabled with these optimizations.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1428
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1843
No backport needed — this is a test-only performance improvement.
Closesscylladb/scylladb#29416
* github.com:scylladb/scylladb:
test: optimize compaction_strategy_cleanup_method for remote storage
test: optimize stcs_reshape_overlapping for remote storage
test: optimize twcs_reshape_with_disjoint_set for remote storage
test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
test: parallelize SSTable creation in run_incremental_compaction_test
test: parallelize SSTable creation in offstrategy_sstable_compaction
test: parallelize SSTable creation in twcs_partition_estimate
test: add trace-level logging for S3 and HTTP in compaction tests
test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
test: move make_memtable out of external_updater in row_cache_test
test: increase S3 max connections for compaction tests
The variable was constructed but never used — the original iterator
is returned instead. Fix the misleading comment to explain the
open-shard semantics of returning the original iterator.
Both disable_stream and wait_for_active_stream used time.process_time()
for their timeouts, but process_time measures CPU time, not wall-clock
time. Since these loops spend most of their time sleeping and waiting on
API calls, the timeouts could last far longer than intended. Use
time.time() instead to enforce actual wall-clock deadlines.
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).
Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
that disabled streams whose log table still exists are listed.
When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.
The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.
Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
now that disabled streams are usable.
- test_streams_reenable: new test verifying that re-enabling produces
a new ARN and the old data is still readable via the old ARN (xfail
because Scylla currently purges old data on re-enable).
Fixesscylladb/scylladb#7239
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.
The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.
When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.
This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
In the test_delete_partition_rows_from_table_with_mv case we perform
a deletion of a large partition to verify that the deletion will
self-throttle when generating many view updates.
Before the deletion, we first build the materialized view, which causes
the view update backlog to grow. The backlog should be back to empty
when the view building finishes, and we do wait for that to happen, but
the information about the backlog drop may not be propagated to the
delete coordinator in time - the gossip interval is 1s and we perform
no other writes between the nodes in the meantime, so we don't make use
of the "piggyback" mechanism of propagating view backlog either. If the
coordinator thinks that the backlog is high on the replica, it may reject
the delete, failing this test.
We change this in this patch - after the view is built, we perform an
extra write from the coordinator. When the write finishes, the coordinator
will have the up-to-date view backlog and can proceed with the DELETE.
Additionally, we enable the "update_backlog_immediately" injection, which
makes the node backlog (the highest backlog across shards) update immediately
after each change.
Fixes: SCYLLADB-1795
Closesscylladb/scylladb#29775
The test used a real-time sleep to move the queued permit into the
preemptive-abort window. If the reactor did not get CPU for long
enough, admission could run only after the permit's timeout had
expired, making the expected abort path flaky.
The test also exhausted memory together with count resources, so the
queued permit could wait for memory. Preemptive abort is intentionally
not applied to permits waiting for memory, so keep enough memory
available and assert that the permit is queued only on count.
Use an immediate preemptive-abort threshold and a long finite timeout
to exercise admission-time abort without relying on scheduler timing.
Fixes: SCYLLADB-1796
Closesscylladb/scylladb#29736
CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky.
This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long.
The patch also improves the behavior of a schema-agreement timeout, when it happens:
1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait.
Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)
This patch is only important to improve test stability in extremely slow test machines where schema agreement sometimes (very rarely) takes over 10 seconds. It's not important to backport it to branches that don't run CI very often on slow machines.
Closesscylladb/scylladb#29744
* https://github.com/scylladb/scylladb:
alternator: improve CreateTable/UpdateTable schema agreement timeout
migration_manager: unique timeout exception for wait_for_schema_agreement()
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.
It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.
Fixes: SCYLLADB-1717
Closesscylladb/scylladb#29682
Replace the naive host.is_up check with wait_for_cql_and_get_hosts() which
actually executes a query against each host, ensuring the driver's connection
pool is fully re-established before proceeding to stop the last server.
The is_up flag is set asynchronously via gossip and doesn't guarantee the
connection pool has live TCP connections. After a server restart, the flag
may be True while the pool still holds stale connections. When the pool
monitor later discovers them dead it briefly marks the host DOWN, causing
NoHostAvailable if another server is being stopped concurrently.
Fixes SCYLLADB-1840
Closesscylladb/scylladb#29769
This PR includes two changes that make gossiper much less likely to mark
nodes as down in tests unexpectedly, and cause test flakiness in issues
like SCYLLADB-864:
- fixing false node conviction when echo succeeds,
- increasing the failure_detector_timeout fixture.
Fixes: SCYLLADB-864
No need for backport: related CI failures are rare, and merging #29522
made them even more unlikely (I haven't seen one since then, but it's
still possible to reproduce locally on dev machines).
Closesscylladb/scylladb#29755
* github.com:scylladb/scylladb:
test/cluster: increase failure_detector_timeout
gossiper: fix false node conviction when echo succeeds
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.
Fixes SCYLLADB-1794
Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.
Closesscylladb/scylladb#29752
* github.com:scylladb/scylladb:
vector_search: test: default timeout in test_dns_resolving_repeated
vector_search: test: fix flaky test_dns_resolving_repeated
Fixes: SCYLLADB-1815
If we're in a brand new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.
As for the test:
Checking segment sizes should not use a size filter that rounds (up) sizes.
More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.
Closesscylladb/scylladb#29753
* github.com:scylladb/scylladb:
commitlog_test.py: Fix size check aliasing, and threshold calc.
commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.
When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.
As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.
The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816
Backport: all supported versions 2026.1, 2025.4, 2025.1
Closesscylladb/scylladb#29757
* github.com:scylladb/scylladb:
test: add reproducer for auth cache crash on missing permissions column
auth: tolerate missing permissions column in authorize()
auth: add defensive has() guard for role_attributes value column
auth: remove unused permissions field from cache role_record
Commit cf237e060a introduced 'from test.pylib.driver_utils import
safe_driver_shutdown' in pgo/exec_cql.py. This module runs during PGO
profile training (a build step) where the test package is not on the
Python path, causing an immediate ModuleNotFoundError on both x86 and
ARM. Revert to plain cluster.shutdown() which is sufficient for the
single-use PGO training scenario.
Fixes: SCYLLADB-1792
Closesscylladb/scylladb#29746
Verify that double-quoted column names with hyphens (e.g. "my-col")
work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that
unquoted hyphenated names are rejected with a syntax error.
The 'address' UDT contains a nested collection (map<text, frozen<phone>>),
so it must be frozen when used as a column type. Non-frozen UDTs with
nested non-frozen collections are not supported.
Column names containing hyphens must be double-quoted. Also fix
the PRIMARY KEY reference from 'customer_id' (non-existent) to
'cust_id' (the actual column).
The grammar requires IF NOT EXISTS to appear before USING TTL,
not after. The example had 'USING TTL 86400 IF NOT EXISTS' which
produces a syntax error.
Move the local vector index test from C++ vector_store_client_test to
Python test_vector_search_with_vector_store_mock.
The test creates a local vector index on ((pk1, pk2), embedding) and
verifies that SELECT with partition key restriction and ANN ordering
works correctly.
Move the SCYLLADB-635 regression test from C++ vector_store_client_test
to Python test_vector_search_with_vector_store_mock.
The test creates a vector index on (embedding, ck1) and verifies that
SELECT with ANN ordering works correctly when additional filtering
columns are included in the index definition.
Move the test that verifies HTTP error descriptions from the vector
store are propagated through CQL InvalidRequest messages from the C++
vector_store_client_test to the Python test_vector_search_with_vector_store_mock.
The test configures the mock to return HTTP 404 with 'index does not
exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.
Move vector search (ANN ordered select query) with IN restrictions on
partition key from C++/Boost test suite to pytest (cqlpy).
Add VectorStoreMock server as pytest fixture to simulate vector store
responses.
Replace explicit 1-second timeouts in repeat_until() with the default
STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for
loaded CI environments where lowres_clock granularity (~10ms) combined
with OS scheduling delays and resource contention (-c2 -m2G) could cause
the loop to expire before the DNS refresh task completes its cycle.
This also unifies test timeouts across test cases.
Move trigger_dns_resolver() inside the repeat_until loop instead of
calling it once before the loop.
The test was intermittently timing out on CI. The exact root cause is not
fully understood, but the hypothesis is that a single trigger signal can
be lost somewhere (not exactly known where). This is not an issue for the
production code because refresh trigger will be called multiple times -
in every query where all configured nodes will be unreachable.
By triggering inside the loop, we ensure the signal is re-sent on
each iteration until the resolver actually performs the refresh and
picks up the new (failing) DNS resolution. This makes the test
resilient to timing-dependent signal loss without changing production
code.
Fixes: SCYLLADB-1794
Ghost rows in role_permissions with a live row marker but no permissions
column can occur when permissions created via INSERT (e.g. by the removed
auth v2 migration) are later revoked. The row marker survives the revoke,
leaving a row visible to queries but with permissions=null.
Add a has() guard before accessing the permissions column, matching the
pattern already used in list_all(). Return NONE permissions for such
ghost rows instead of crashing.
Add a has() check before accessing the value column in role_attributes
to tolerate ghost rows with missing regular columns. In practice this
is unlikely to be a problem since attributes are not typically revoked,
but the guard is added for consistency and defensive programming.
The permissions field in role_record was populated by fetch_role() but
never read. Authorization uses cached_permissions instead, which is
loaded via the permission_loader callback. Remove the dead field and
its fetch code.
The removed code also did not check for missing columns before accessing
the permissions set, which could crash on ghost rows left by the removed
auth v2 migration. The migration used INSERT (creating row markers),
and when permissions were later revoked, the row marker survived while
the permissions column became null.
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
Assertion `caching_enabled()' failed.
at utils/loading_cache.hh:319
in authorized_prepared_statements_cache::insert
```
`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.
Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.
Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.
Fixes SCYLLADB-1699
The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.
Closesscylladb/scylladb#29638
* github.com:scylladb/scylladb:
test: boost: regression test for loading_cache::insert with caching disabled
utils: loading_cache: add insert() that is a no-op when caching is disabled
Scaling the timeout by build mode (#29522) turned out to be not sufficient.
Nodes can still be unexpectedly marked as down, even with a 4s timeout in
dev mode. I managed to reproduce SCYLLADB-864 in such conditions.
Increasing failure_detector_timeout will proportionally slow down tests
that use it. That's bad, but currently these tests' flakiness is a much
bigger problem than the tests' slowness. Also, not many tests use this
fixture, and we hope to make it unneeded eventually (see #28495).
failure_detector_loop_for_node() could falsely convict a healthy node
even when the echo succeeded. The code computed diff = now - last
(time since last successful echo) and checked diff > max_duration
unconditionally, regardless of whether the current echo failed or
succeeded.
This caused flakiness in tests that decrease the failure detector
timeout. We currently run #CPUs tests concurrently, and since cluster
tests start multiple nodes with 2 shards, multiple shards contend for
one CPU. As a result, some tasks can become abnormally slow and block
the failure detector loop execution for a few seconds.
Fix by only checking diff > max_duration when the echo actually
failed.
Note that we send echo with the timeout equal to `max_duration` anyway,
so the receiver will be marked as down if it really doesn't respond.
Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281
Bakport since we want to have it if it happens in the field.
Fixes: SCYLLADB-1821
Refs: #26281Closesscylladb/scylladb#29735
* https://github.com/scylladb/scylladb:
session, raft_topology: add periodic warnings for hung drain and stale version waits
session: add info-level logging to drain_closing_sessions
raft_topology: log sub-step progress in local_topology_barrier
raft_topology: log read_barrier progress in topology cmd handler
Fixes: SCYLLADB-1815
Checking segment sizes should not use a size filter that rounds
(up) sizes.
More importantly, the estimate for what is acceptable limit for
commitlog disk usage should be aligned. Simplified the calc, and
also made logging more useful in case of failure.
CreateTable and UpdateTable call wait_for_schema_agreement() after
announcing the schema change, to ensure all live nodes have applied
the new schema before returning to the user. This wait has a hard-
coded 10 second timeout, and on some overloaded test machines we
saw it not completing in time, and causing tests to become flaky.
This patch increases this timeout from 10 seconds to 30 seconds.
It's still hard-coded and not configurable via alternator_timeout_in_ms
because it is unlikely any user will want to change it - it just needs
to be long.
The patch also improves the behavior of a schema-agreement timeout,
when it happens:
1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the
operation is unknown; So the user will repeat the CreateTable, and
will get a ResourceInUseException because the table exists. In that
case too, we need to wait for schema agreement. So we added this
missing wait.
Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Refs: SCYLLADB-1757
Refs: SCYLLADB-1815
If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the
actual size of an entry to write, possibly causing segment size overshoot.
Break out some logic to share between this calc and new_buffer. Also remove redundant
(and possibly wrong) constant in oversized allocation.
Before this patch, if wait_for_schema_agreement() times out, it threw
a generic std::runtime_error, making it inconvenient for callers to
catch this error only. So in this patch we create and use a new exception
type, schema_agreement_timeout, based on seastar::timed_out_error.
Although wait_for_schema_agreement() was added in commit
a429018a8a was a utility function used in
a dozen places, it has become less interesting after we introduced schema
changes over Raft, and over the years most of the callers to this function
were removed, except one in view.cc which uses an infinite timeout, so
doesn't care about the timeout exception type.
In the next patch we want to add a new caller which *does* care about
the time exception type - hence this patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add periodic warning timers (every 5 minutes) to help diagnose hangs in
barrier_and_drain:
- drain_closing_sessions(): warn if semaphore acquisition or session gate
close is taking too long, reporting the gate count to show how many
guards are still alive.
- local_topology_barrier(): warn if stale_versions_in_use() is taking
too long, reporting the current stale version trackers.
- session::gate_count(): new public accessor for diagnostic purposes.
These warnings help distinguish between the two possible hang points
in barrier_and_drain (stale versions vs session drain) and provide
ongoing visibility into what's blocking progress.
drain_closing_sessions() is called as part of the barrier_and_drain
topology command and can block on two things: acquiring the drain
semaphore (if another drain is in progress) and waiting for individual
sessions to close (which blocks until all session guards are released).
Previously, all logging in this function was at debug level, making it
invisible in production logs. When barrier_and_drain hangs, there is no
way to tell whether the function is waiting for the semaphore, waiting
for a specific session to close, or was never called.
Promote logging to info level and add messages at each blocking point:
before/after semaphore acquisition (with count of sessions to drain),
before/after each individual session close (with session id), and at
function completion. This makes it possible to identify the exact
session blocking a topology operation from the node log alone.
When a node processes a barrier_and_drain topology command, it performs
two potentially long-running operations inside local_topology_barrier():
waiting for stale token metadata versions to be released
(stale_versions_in_use) and draining closing sessions
(drain_closing_sessions). Either of these can hang indefinitely -- for
example, stale_versions_in_use blocks until all references to previous
token metadata versions are released, which depends on in-flight
requests completing.
Previously, the only logging was a single 'done' message at the end,
making it impossible to determine which sub-step was blocking when a
barrier_and_drain RPC appeared stuck on a node. In a recent CI failure,
a node never responded to barrier_and_drain during a removenode
operation, and the logs showed the RPC was received but nothing about
what it was waiting on internally.
Add info-level logging before each blocking sub-step, including the
topology version for correlation. This allows diagnosing hangs by
showing whether the node is stuck waiting for stale metadata versions,
stuck draining sessions, or never reached these steps at all.
When a raft topology command (e.g. barrier_and_drain) is received by a
node, the handler first performs a raft read_barrier to ensure it sees
the latest topology state. This read_barrier can hang indefinitely if
raft cannot achieve quorum, but there was no logging around it, making
it impossible to tell whether the handler was stuck at this step or
somewhere else.
Add info-level logging before and after the read_barrier call in
raft_topology_cmd_handler, including the command type, index, and term.
This allows diagnosing hangs by showing whether the node entered the
read_barrier and whether it completed, narrowing down the root cause
when a topology command RPC appears stuck on the receiver side.
Add two test cases for the new loading_cache::insert() method:
* test_loading_cache_insert verifies that insert() populates the cache
and invokes the loader exactly once per key when caching is enabled.
* test_loading_cache_insert_caching_disabled is a regression test for
SCYLLADB-1699: when the cache is constructed with expiry == 0
(caching disabled), insert() must be a no-op rather than asserting
in loading_cache::get_ptr() via caching_enabled(). The loader must
not be invoked and the cache must remain empty.
Refs SCYLLADB-1699
When the cache is constructed with expiry == 0 the underlying storage is
never instantiated and get_ptr() asserts via caching_enabled(). This is
fine for callers that need a handle into the cache, but it makes get_ptr()
unusable for write-only insertions on caches whose expiry is configurable
at runtime (e.g. caches driven by a LiveUpdate config option that the
operator may set to 0).
Add a new insert(k, load) method on loading_cache that returns a future<>
and is a no-op when caching is disabled, otherwise forwards to
get_ptr(k, load) and discards the resulting handle. This completes the
disabled-mode safety contract of the cache for the write side, mirroring
the fallback that get() already provides for the read side.
Switch authorized_prepared_statements_cache::insert() from
get_ptr().discard_result() to the new insert(), which fixes the crash
'Assertion caching_enabled() failed' in
authorized_prepared_statements_cache::insert() that occurs when
permissions_validity_in_ms is set to 0 and a prepared statement is
executed under authentication.
Fixes SCYLLADB-1699
Parallelize SSTable creation using parallel_for_each. The file
count is made a parameter with a default of 64, allowing future
S3/GCS variants to use a smaller count if needed.
Parallelize SSTable creation using parallel_for_each and reduce
the SSTable count from 256 to 64 for S3/GCS variants. The local
test variant retains the original 256 count.
Parallelize SSTable creation across all sub-tests using
parallel_for_each and reduce the SSTable count from 256 to 64 for
S3/GCS variants.
Re-enable the S3 test variant that was previously disabled due to
taking 4+ minutes. With parallel creation and reduced count, the
test now completes in a reasonable time.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations.
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations that depend on SSTable order.
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
Raise log levels for s3 and gcp_storage from debug to trace, and add
trace-level logging for http and default_http_retry_strategy modules.
This provides better visibility into storage backend interactions
when debugging slow or failing compaction tests on remote storage.
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
test_exception_safety_of_update_from_memtable called make_memtable
inside the row_cache::external_updater callback. external_updater
runs as a synchronous execute() call that must not yield, but
make_memtable calls seastar::thread::yield() every 10th mutation.
The bug was latent because the test only inserted 5 mutations, so
the yield was never reached. Move the call before the callback.
Prerequisite for the next patch, which changes make_memtable to
call make_memtable_async().get() -- that would yield on every
mutation via coroutine::maybe_yield(), making this bug visible.
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.
Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.
Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
The load balancer's get_schema_and_rs() would trigger on_internal_error when
a table present in the token metadata snapshot had been concurrently dropped
from the live schema. This race is possible because the balancer coroutine
yields between building the candidate list and checking replication
constraints, allowing a DROP TABLE schema mutation to be applied by another
fiber in the meantime.
Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables
instead of aborting. Update all callers to skip dropped tables:
- make_sizing_plan: continue to next table
- make_resize_plan: continue to next table (merge suppression is moot)
- check_constraints: return skip_info with empty viable targets
- get_rs: return nullptr, checked by check_constraints
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
PRIMARY KEY ((table_id, node_owner), generation).
This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.
The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.
This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
The cas_contention_timeout_in_ms option is already exposed via the
shared updateable_timeout_config as cas_timeout_in_ms. Read it from
there instead of going through db::config, dropping another use of
database as a db::config proxy.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass sharded<updateable_timeout_config>& into alternator::controller
and through to alternator::server, which now stores a reference
instead of constructing its own updateable_timeout_config from
proxy.data_dictionary().get_config(). This removes the last
creator of a per-owner updateable_timeout_config copy and completes
the consolidation onto the single sharded<updateable_timeout_config>
instance built in main.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass sharded<updateable_timeout_config>& into cql_transport::controller,
which feeds the shard-local instance as a reference into
cql_server_config::timeout_config. This drops the per-shard local
updateable_timeout_config constructed from db::config inside the
controller's sharded_parameter lambda, replacing it with a reference
into the shared sharded instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Drop storage_proxy's own updateable_timeout_config member built from
db::config and take a reference to the shared sharded instance
introduced by the previous patch. Both main and cql_test_env pass
std::ref(timeout_cfg) into storage_proxy::start so each shard's
storage_proxy references its shard-local updateable_timeout_config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Build a single sharded updateable_timeout_config from db::config in
both main and cql_test_env, sitting next to sharded<cql_config>.
Subsequent patches migrate storage_proxy, the CQL transport controller
and alternator server from their per-owner updateable_timeout_config
copies to references into this shared instance.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Storage_proxy was reading read_request_timeout_in_ms and
write_request_timeout_in_ms directly from db::config via
database::get_config() at four call sites. Give storage_proxy its own
updateable_timeout_config member (built from db::config the same way
cql transport controller and alternator server do) and use its
read_timeout_in_ms / write_timeout_in_ms observers instead.
Storage_proxy no longer needs database::get_config() for coordinator
timeout values. A later refactor may turn these per-owner copies into
references to a single shared updateable_timeout_config.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The free function calculate_view_update_throttling_delay() took the
view_flow_control_delay_limit_in_ms as a parameter, which forced its
two callers (storage_proxy and view_update_generator) to fish the
option out of db::config via database::get_config(). Now that the
option lives on node_update_backlog, make the throttling calculation a
member of node_update_backlog and have the callers invoke it on their
node_update_backlog reference.
This removes two database::get_config() call sites.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Store the view_flow_control_delay_limit_in_ms config option as an
updateable_value on node_update_backlog. The value is threaded from
main.cc into the backlog object at construction time. Existing call
sites (tests) that construct node_update_backlog without the option
continue to work via a default argument.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Alternator Streams were experimental until 2026.2, when they became GA.
Stop requiring `--experimental-features=alternator-streams` by:
- Removing ALTERNATOR_STREAMS from the experimental feature enum
- Mapping "alternator-streams" to UNUSED for backward compatibility
- Removing the gating that disabled the ALTERNATOR_STREAMS gossip
feature when the experimental flag was absent
- Removing the runtime guard that rejected StreamSpecification requests
without the feature flag
- Updating config_test to reflect the new UNUSED mapping
The gms::feature alternator_streams is kept for rolling upgrade
compatibility with older nodes.
Fixes SCYLLADB-1680
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.
This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
Because now we're limiting the range we're reading from view building
tasks table, we need to make sure that new tasks are created with larger
uuid then the `min_alive_uuid`.
In order to do it, we need to be able to see current `min_alive_uuid`
while creating new tasks.
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id
alongside the range tombstone in the same Raft batch. min_task_id is set
to min_alive_uuid so subsequent get_view_building_tasks() scans start
exactly at the first alive row, skipping all tombstoned rows.
When all tasks are deleted, min_task_id is set to a freshly generated UUID
to ensure future tasks (which will have larger timeuuids) are not skipped.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add set_min_task_id(id) which writes the min_task_id static cell to the main
"view_building" partition. The static cell is written as part of the same
mutation as the range tombstone, keeping everything in one Raft batch.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a min_task_id timeuuid static column to system.view_building_tasks.
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks()
reads min_task_id first using a static-only partition slice (empty _row_ranges +
always_return_static_content). This makes the SSTable reader stop immediately
after the static row before processing any clustering tombstones, so the read
never triggers tombstone_warn_threshold warnings.
min_task_id is then used as AND id >= ? lower bound for the main task scan,
skipping all tombstoned rows below the boundary.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Instead of issuing one row tombstone per finished task, collect all tasks
to delete, find the smallest timeuuid among alive tasks (min_alive_uuid),
then emit a single range tombstone [before_all, min_alive_uuid) covering
all tasks below that boundary. Tasks above the boundary (rare: finished
task interleaved with alive tasks) still get individual row tombstones.
When no alive tasks remain, del_all_tasks() covers the entire partition
with a single range tombstone.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add del_tasks_before(id) which emits a range tombstone [before_all, id)
and del_all_tasks() which covers the entire clustering range. These will
be used by the coordinator to delete finished tasks in bulk instead of
issuing one row tombstone per task.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This feature will be used to gate the use of min_task_id static column
in system.view_building_tasks, which will be added in a subsequent commit.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().
Count commit_status_unknown errors in coordinator::mutate().
Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
Introduce per-shard metrics infrastructure for strong consistency
operations under the "strong_consistency_coordinator" metrics category.
The stats struct contains latency histograms/summaries for reads and
writes (using timed_rate_moving_average_summary_and_histogram, same as
storage_proxy uses for eventual consistency), and uint64_t counters for
write_status_unknown, node bounces, and shard bounces.
Metrics are registered in the coordinator constructor but are not yet
wired to actual operations — all counters remain at zero.
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage). All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
Create a unique S3/GCS bucket for each test function using the pytest
test name (from request.node.name), sanitized into a valid bucket name.
This ensures tests do not share state through a common bucket and makes
bucket names meaningful for debugging (e.g. test-basic-s3-a1b2c3d4).
Each fixture now calls create_test_bucket() on setup and
destroy_test_bucket() on teardown.
Add a client::make overload that accepts a custom retry strategy,
allowing callers to override the default exponential backoff.
Use this in s3_test.cc with a test_retry_strategy that sleeps only
1ms between retries instead of exponential backoff, significantly
reducing test runtime for tests that encounter transient errors
during bucket creation/deletion.
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
Add create_bucket (PUT /<bucket>) and delete_bucket (DELETE /<bucket>)
methods to s3::client, following the same make_request pattern used by
existing object operations.
These will be used by the test infrastructure to create per-test
isolated buckets.
They are used only to prevent permission change, but since tables are
unused even if they exists there is no problem changing their
permissions, so no point keeping the definitions just for that.
setup_group0 and setup_group0_if_exist have hidden condition inside that
make them no-op. It is not clear at the call site that functions may do
nothing. Change the code to check the conditions at the call site
instead.
Defaults to 0. When N > 0, adds a map<blob, blob> collection column to
the schema. Each row will have a collection cell with N elements.
Allows benchmarking collection handling.
This cuts back on the number of allocations required for deserializing
collections, from O(num_cells) to O(1).
The visitor now receives an rvalue, so update all callers of
read_and_visit_row(), patching their vistors to take advantage of this
and move the serialized collection instead of copying it.
Reads a collection_mutation directly from the IDL representation of a
collection. This cuts down the number of allocations required
drastically compared to the current method of:
IDL -> collection_mutatio_description -> collection_mutation
Intended to be used in frozen_mutation::unfreeze() and similar use-cases.
atomic_cell_type has various static make_*() methods which create a
serialized cell based on the parameters. This patch adds write_()
methods which mirror the existing make_*() ones, with the exception that
the write methods write into caller-provided buffer. The make methods
are refactored to call the appropriate write overload.
*_serialized_size() methods are added as well, to calculate how many
bytes the serialized data will take after the appropriate write call.
This allows code to write cells directly into a pre-arranged buffer,
perhaps even multiple ones into the same one.
Since the intended use-case this patch prepares for is serializing an
entire collection directly into a single buffer, only make variants
which are legal in collections are handled. I.e. counters are not.
This is already a template on Iterator, but generalize it further by
adding an Adaptor template which adapts the Iterator::value_type to the
requirements of the method. This allows passing Iterators with
value_type other than atomic_cell[_view].
Instead of collection_mutation_view. Follow-suit of the atomic_cell
overloads, which already accept a value, to allow for caller to move the
value along. The current interface forces collections to be copied.
throwapi_error::internal(fmt::format("The operation was successful, but unable to confirm cluster-wide schema agreement after {} seconds. Please retry the operation, and wait for the retry to report an error since the operation was already done.",schema_agreement_seconds));
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"tablet_aware_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of the snapshot to restore from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"backup_location",
"description":"JSON array of backup location objects. Each object must contain: 'datacenter' (string), 'endpoint' (string), 'bucket' (string), and 'manifests' (array of strings). Currently, the array must contain exactly one entry.",
sm::description("Holds the sum of normalized compaction backlog for all tables in the system. Backlog is normalized by dividing backlog by shard's available memory.")),
,enable_shard_aware_drivers(this,"enable_shard_aware_drivers",value_status::Used,true,"Enable native transport drivers to use connection-per-shard for better performance.")
,abort_on_internal_error(this,"abort_on_internal_error",liveness::LiveUpdate,value_status::Used,false,"Abort the server instead of throwing exception when internal invariants are violated.")
"Abort the server and generate a coredump instead of throwing an exception when any sstable parse error is detected (malformed_sstable_exception, bufsize_mismatch_exception, parse_assert() failures, or BTI parse errors). Intended for debugging memory corruption that may manifest as sstable corruption. Defaults to true in debug and dev builds.")
staticconstsstringupdate_query=format("UPDATE {}.{} USING TTL {} SET downloaded = ? WHERE snapshot_name = ? AND \"keyspace\" = ? AND \"table\" = ? AND "
"datacenter = ? AND rack = ? AND first_token = ? AND sstable_id = ?",
staticconstsstringbounded_query=format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}' AND id >= ?",
There are a few system tables that object storage related code needs to touch in order to operate.
* [system_distributed.snapshot_sstables](docs/dev/snapshot_sstables.md) - Used during restore by worker nodes to get the list of SSTables that need to be downloaded from object storage and restored locally.
* [system.sstables](docs/dev/system_keyspace.md#systemsstables) - Used to keep track of SSTables on object storage when a keyspace is created with object storage storage_options.
# Manipulating S3 data
This section intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands
To only mark the node as permanently down without doing actual removal, use :doc:`nodetool excludenode </operating-scylla/nodetool-commands/excludenode>`:
Nodes in the cluster finished streaming data to the new node:
@@ -86,7 +86,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. When the new node status is Up Normal (UN), run the :doc:`nodetool cleanup </operating-scylla/nodetool-commands/cleanup>` command on all nodes in the cluster except for the new node that has just been added. Cleanup removes keys that were streamed to the newly added node and are no longer owned by the node.
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. If the node status is **Up Normal (UN)**, run the :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` command
to remove the node you are connected to. Using ``nodetool decommission`` is the recommended method for cluster scale-down operations. It prevents data loss
@@ -75,7 +75,7 @@ command providing the Host ID of the node you are removing. See :doc:`nodetool r
*`Metrics Update Between 2025.3 and 2025.4 <https://docs.scylladb.com/manual/branch-2025.4/upgrade/upgrade-guides/upgrade-guide-from-2025.x-to-2025.4/metric-update-2025.x-to-2025.4.html>`_
*`Metrics Update Between 2025.2 and 2025.3 <https://docs.scylladb.com/manual/branch-2025.3/upgrade/upgrade-guides/upgrade-guide-from-2025.2-to-2025.3/metric-update-2025.2-to-2025.3.html>`_
*`Metrics Update Between 2025.1 and 2025.2 <https://docs.scylladb.com/manual/branch-2025.2/upgrade/upgrade-guides/upgrade-guide-from-2025.1-to-2025.2/metric-update-2025.1-to-2025.2.html>`_
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.