Compare commits

...

309 Commits

Author SHA1 Message Date
Alex
a0303bfd41 test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
2026-05-13 16:00:02 +03:00
Alex
12dfd9b487 qos: self-heal stale service levels version on startup
Add self_heal_service_levels_version() and use it during startup when
  the node is already on raft topology but service levels are still marked
  as v1.

  In that stale state, migrate service levels to v2 through group0 instead
  of failing startup.
2026-05-13 16:00:02 +03:00
Alex
ac0a19aab8 qos: reintroduce service levels v2 migration self-heal
migrate_to_v2() was removed after gossip-based service level migration
  support was dropped, since upgraded nodes were expected to already use
  service levels v2.

  However, clusters affected by the old migration bug may reach raft topology
  while system.scylla_local still has a stale service level version. Restore
  the migration helper so startup can self-heal those nodes by writing the v2
  state through group0.
2026-05-13 10:16:02 +03:00
Yaniv Michael Kaul
5d6f160129 test: update get_scylla_2025_1_executable() to use 2025.1.12
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.

The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29714
2026-05-12 23:20:55 +02:00
Wojciech Mitros
f3cf20803b test: run test_mv_admission_control_exception on one shard
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.

On each node/shard we have 2 types of view update backlogs:
 1. for deciding whether we should admit writes
 2. for propagating the backlog information to other nodes/shards.

For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.

The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.

We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.

That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.

We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.

Fixes: SCYLLADB-1901

Closes scylladb/scylladb#29862
2026-05-12 17:34:19 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Yaniv Michael Kaul
c359a09189 test: add UDF/UDA keyspace isolation and UDT tests
Port 3 tests from scylla-dtest user_functions_test.py:
- test_udf_with_udt: UDF taking frozen UDT arg, verifies DROP TYPE blocked
- test_udf_with_udt_keyspace_isolation: cross-keyspace UDT references rejected
- test_aggregate_with_udt_keyspace_isolation: cross-keyspace UDT in UDA rejected

All tests use Lua (Scylla's supported UDF language).
Reproduces CASSANDRA-9409.

Closes scylladb/scylladb#1928

Closes scylladb/scylladb#29843
2026-05-12 14:57:14 +03:00
Yaniv Michael Kaul
f55a55fbf3 docker: fix coredump collection when host uses pipe-based core_pattern
The container image inherits kernel.core_pattern from the host.  When
the host pipes core dumps to a handler (e.g. Ubuntu's apport), that
handler does not exist or work correctly inside the container, so core
dumps are silently lost.

Override any pipe-based core_pattern with a file-based pattern that
writes directly to /var/lib/scylla/coredump/.  The override is attempted
both from the entrypoint (scyllasetup.coredumpSetup) and from
scylla-server.sh when running as root; it succeeds only when the
container has write access to /proc/sys/kernel/core_pattern and is
silently skipped otherwise.

Fixes: SCYLLADB-1366

Closes scylladb/scylladb#29337
2026-05-12 14:16:22 +03:00
Piotr Smaron
1018710e38 test/cqlpy: un-xfail oversized indexed value build test
Issue #8627 is fixed, so test_too_large_indexed_value_build now passes and should run normally instead of XPASSing under strict xfail.

Fixes: SCYLLADB-1938

Closes scylladb/scylladb#29853
2026-05-12 11:40:53 +02:00
Avi Kivity
ddb1181103 Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1803

Closes scylladb/scylladb#29791

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-12 12:38:25 +03:00
Andrzej Jackowski
89261bf759 test: wait for TTL scheduling sanity metric
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.

Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.

In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.

Fixes SCYLLADB-1869

Closes scylladb/scylladb#29797
2026-05-12 12:38:25 +03:00
Avi Kivity
6fca064ac8 Merge 'alternator: a couple of small cleanups suggested by copilot' from Nadav Har'El
The first patch improves the input validation of  the CONTAINS operator. I believe this is not a critical fix, because RapidJSON already has exception-throwing RAPIDJSON_ASSERT() that check for unexpected JSON structure (like something we expect to be a list isn't actually a list), but it's cleaner to do these checks explicitly.

The second patch just removes an unnecessary call to format() on a constant string.

Closes scylladb/scylladb#28506

* github.com:scylladb/scylladb:
  alternator: remove unneeded call to format()
  alternator: improve CONTAINS operator's validity checking
2026-05-12 12:38:25 +03:00
Botond Dénes
8d6f031a4a schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.

Fixes SCYLLADB-1353

Closes scylladb/scylladb#29804
2026-05-12 12:38:25 +03:00
Piotr Dulikowski
7c2b1ea0b5 Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
  as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.

Two-part fix:

**1. Range tombstones instead of row tombstones (commits 2–3)**

Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.

**2. Bounded scan with `min_task_id` (commits 4–6)**

Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.

   - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
   - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
   - On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
   - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.

The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.

The issue is not critical, so the fix shouldn't be backported.

Fixes SCYLLADB-657

Closes scylladb/scylladb#28929

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
  docs: document tombstone avoidance in view_building_tasks
  view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
  view_building: introduce `task_uuid_generator`
  view_building: store `min_alive_uuid` in view building state
  view_building: set min_task_id when GC-ing finished tasks
  view_building: add min_task_id support to view_building_task_mutation_builder
  view_building: add min_task_id static column and bounded scan to system_keyspace
  view_building: use range tombstone when GC-ing finished tasks
  view_building: add range tombstone support to view_building_task_mutation_builder
  view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
2026-05-12 12:38:25 +03:00
Avi Kivity
cf50f0191a encryption: fix deprecated input_stream/output_stream usage in KMIP connection
Seastar deprecated default-constructing input_stream and output_stream
(they are useless in that state), and also deprecated move-assigning
them after the fact.

Fix by wrapping both fields in std::optional, and using emplace() to
construct them in-place once the connected socket is available.

It would be nicer to make connect() a static method that returns
a connection, but that's a larger change.

Closes scylladb/scylladb#29627
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
1c0f8ab66e Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.

This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:

- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)

The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.

The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.

**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated

Refs: SCYLLADB-1087
Backport: new feature, no backport

Closes scylladb/scylladb#29324

* github.com:scylladb/scylladb:
  sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
  sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
  sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
  sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
  sstables: introduce --abort-on-malformed-sstable-error infrastructure
  sstables: refactor parse_path() to return std::expected<> instead of throwing
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
150345cc52 Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.

New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.

A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.

A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.

Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.

Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.

| Test Name                                                    | new test specific retry strategy execution time (ms) | original execution time (ms) |   Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy      |          19,261 |       61,395 | −42,134  | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy   |          16,901 |       53,688 | −36,787  | **3.2×** |
| test_client_upload_file_single_part_proxy                    |           3,478 |        6,789 |  −3,311  | **2.0×** |
| test_client_multipart_copy_upload_proxy                      |           1,303 |        1,619 |    −316  | 1.2×    |
| test_client_put_get_object_proxy                             |             150 |          365 |    −215  | **2.4×** |
| test_client_readable_file_stream_proxy                       |             125 |          327 |    −202  | **2.6×** |
| test_small_object_copy_proxy                                 |             205 |          389 |    −184  | 1.9×    |
| test_client_put_get_tagging_proxy                            |             181 |          350 |    −169  | 1.9×    |
| test_client_multipart_upload_proxy                           |           1,252 |        1,416 |    −164  | 1.1×    |
| test_client_list_objects_proxy                               |             729 |          881 |    −152  | 1.2×    |
| test_chunked_download_data_source_with_delays_proxy          |             830 |          960 |    −130  | 1.2×    |
| test_client_readable_file_proxy                              |             148 |          279 |    −131  | 1.9×    |
| test_client_upload_file_multi_part_with_remainder_minio      |           3,358 |        3,170 |    +188  | 0.9×    |
| test_client_upload_file_multi_part_without_remainder_minio   |           3,131 |        2,929 |    +202  | 0.9×    |
| test_client_upload_file_single_part_minio                    |             519 |          421 |     +98  | 0.8×    |
| test_download_data_source_proxy                              |             180 |          237 |     −57  | 1.3×    |
| test_client_list_objects_incomplete_proxy                     |             590 |          641 |     −51  | 1.1×    |
| test_large_object_copy_proxy                                 |             952 |          991 |     −39  | 1.0×    |
| test_client_multipart_upload_fallback_proxy                  |             148 |          185 |     −37  | 1.3×    |
| test_client_multipart_copy_upload_minio                      |             641 |          674 |     −33  | 1.1×    |

No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.

Closes scylladb/scylladb#29508

* github.com:scylladb/scylladb:
  test: extract object storage helpers to test/pylib/object_storage.py
  test: add per-test bucket isolation to object_store fixtures
  s3: add client::make overload with custom retry strategy
  test: add s3_test_fixture and migrate tests to per-bucket isolation
  s3: add create_bucket and delete_bucket to client
2026-05-12 12:38:24 +03:00
Dimitrios Symonidis
94bc0245f9 sstables, utils/s3: reuse caller-provided file in s3_storage::make_source
s3_storage::make_source previously ignored its file f parameter and
constructed a fresh s3::client::readable_file per call. The new
file's _stats cache was empty, so the first dma_read_bulk issued a
HEAD via maybe_update_stats just to learn the object size before
the ranged GET -- one ~50 ms RTT per uncached read.

The file f passed in by the two callers (sstable::data_stream for
Data.db reads and index_reader::make_context for Index.db reads)
already wraps the sstable's _data_file or _index_file. Those file
objects had their stats populated at sstable open time by
update_info_for_opened_data, and they were wrapped with the
configured file_io_extensions when opened via open_component. Reusing
them is exactly what filesystem_storage::make_source does (one-line
make_file_data_source over f), so the s3 path simply matches it.

readable_file::size() is also updated to route through
maybe_update_stats(), so a .size() call populates the _stats cache
the same way .stat() does -- preventing a redundant HEAD on the
first subsequent read of components opened with .size() (Index,
Partitions, Rows in update_info_for_opened_data).

Closes scylladb/scylladb#29766

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 12:38:24 +03:00
Pavel Emelyanov
896de77b99 docs: Update topology_over_raft.md with restore transition kind
Add some text about how the new transition works. It doesn't include
full feature description, just concentrates on the new transition and
the way it interacts with the rest of topology coordinator machinery.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
19820910f8 test: Add test for backup vs migration race
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
3bcefa42c5 test: Restore resilience test
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:

- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
  some tablets should fail to get restored. If the coordinator is lost,
  it triggers coordinator re-election and new coordinator still notices
  that a tablet that was replicated to "old" coordinator failed to get
  restored and fails the restore anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
69b8f76a32 sstables_loader: Fail tablet-restore task if not all sstables were downloaded
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.

Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
bdc5976bcd sstables_loader: mark sstables as downloaded after attaching
After each SSTable is successfully attached to the local table in
download_tablet_sstables(), update its downloaded status in
system_distributed.snapshot_sstables to true. This enables tracking
restore progress by counting how many SSTables have been downloaded.
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
0d8de9becd sstables_loader: return shared_sstable from attach_sstable
Change attach_sstable() return type from future<> to
future<sstables::shared_sstable>, returning the SSTable that was
attached. This will be used to extract the SSTable identifier and
first token for updating the download status.
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
7eb921a142 db: add update_sstable_download_status method
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
83ec7e22b9 db: add downloaded column to snapshot_sstables
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.

Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
61c627a7c0 db: extract snapshot_sstables TTL into class constant
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
4137211cf4 test: Add a test for tablet-aware restore
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.

Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
17384d42e3 tablets: Implement tablet-aware cluster-wide restore
This patch adds

- Changes in sstables_loader::restore_tablets() method

It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest

- Implementation of tablet_restore_task_impl::run() method

It emplaces a bunch of tablet migrations with "restore" kind

- Topology coordinator handling of tablet_transition_stage::restore

When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
39ae59da9c messaging: Add RESTORE_TABLET RPC verb
The topology coordinator will need to call this verb against existing
tablet replicas to ask them restore tablet sstables. Here's the RPC verb
to do it.

It now returns an empty restore_result to make it "synchronous" -- the
co_await send_restore_tablets() won't resolve until client call
finishes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
8514b73f4b sstables_loader: Add method to download and attach sstables for a tablet
Extracts the data from snapshot_sstables tables and filters only
sstables belonging to current node and tablet in question, then starts
downloading the matched sstables

Extracted from Ernest PR #28701 and piggy-backs the refactoring from
another Ernest PR #28773. Will be used by next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
cf21471391 tablets: Add restore_config to tablet_transition_info
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
2eaa9035df sstables_loader: Add restore_tablets task skeleton
The new cluster-wide tablets restore API is going to be asynchronous,
just like existing node-local one is. For that the task_manager tasks
will be used.

This patch adds a skeleton for tablets-restore task with empty run
method. Next patches will populate it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
dcd490666b test: Add rest_client helper to kick newly introduced API endpoint
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
5f235e105a api: Add /storage_service/tablets/restore endpoint skeleton
Withdrawn from #28701. The endpoint implementation from the PR is going
to be reworked, but the swagger description and set/unset placeholders
are very useful.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
d280987f2c sstables_loader: Add keyspace and table arguments to manfiest loading helper
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.

This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
e0f4813c2f sstables_loader_helpers: just reformat the code
Reformat get_sstables_for_tablet to wrap extremely long line
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
19554466f6 sstables_loader_helpers: generalize argument and variable names
Rename arguments and local variables in get_sstables_for_tablet to avoid
references to SSTable-specific terminology. This makes the function more
generic and better suited for reuse with different range types.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
2e37f9dc90 sstables_loader_helpers: generalize get_sstables_for_tablet
Generalize get_sstables_for_tablet by templating the return type so it
produces vectors matching the input range’s value type. This makes the
function more flexible and prepares it for reuse in tablet‑aware
restore.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
17b415ccde sstables_loader_helpers: add token getters for tablet filtering
Add getters for the first and last tokens in get_sstables_for_tablet to
make the function more generic and suitable for future use in the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
1150f7cf24 sstables_loader_helpers: remove underscores from struct members
Remove underscores from minimal_sst_info struct members to comply with
our coding guidelines.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
aa00048753 sstables_loader: move download_sstable and get_sstables_for_tablet
Move the download_sstable and get_sstables_for_tablet static functions
from sstables_loader into a new file to make them reusable by the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
991576ed73 sstables_loader: extract single-tablet SST filtering
Extract single-tablet range filtering into a new
get_sstables_for_tablet function, taken from the existing
get_sstables_for_tablets. This will later be reused in the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
b0f6cbb2a4 sstables_loader: make download_sstable static
Make the download_sstable function static to prepare it for extraction
as a helper function that will later be reused in tablet-aware restore.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
60dd7de4b8 sstables_loader: fix formating of the new download_sstable function
Just fix formatting of the new `download_sstable` function
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
9efc658bdd sstables_loader: extract single SST download into a function
Extract the logic for downloading a single SST into a dedicated
function and reuse it in download_fully_contained_sstables. This
supports upcoming changes that consolidate common code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
fd2043cad8 sstables_loader: add shard_id to minimal_sst_info
Add a shard_id member to the minimal_sst_info struct as part of the
tablet-aware restore refactoring. This will support upcoming changes
that extract common code.
2026-05-12 10:40:22 +03:00
Robert Bindar
c97232bb7b sstables_loader: add function for parsing backup manifests
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:22 +03:00
Robert Bindar
f0e8d6c9dd split utility functions for creating test data from database_test
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
b52e40e512 export make_storage_options_config from lib/test_services
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
9c3abbb8f5 rjson: Add helpers for conversions to dht::token and sstable_id
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
2f19d84ad7 Add system_distributed_keyspace.snapshot_sstables
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
    snapshot_name text,
    keyspace text, table text,
    datacenter text, rack text,
    id uuid,
    first_token bigint, last_token bigint,
    toc_name text, prefix text)
  PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.

Fixes SCYLLADB-263

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
31e9f04714 add get_system_distributed_keyspace to cql_test_env
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
90ff7c5de3 code: Add system_distributed_keyspace dependency to sstables_loader
The loader will need to populate and read data from
system_distributed.snapshot_sstables table added recently, so this
dependency is truly needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
2c60d8f897 storage_service: Export export handle_raft_rpc() helper
Just like do_tablet_operation, this one will be used by sstables_loader
restore-tablet RPC

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
1c0e04316b storage_service: Export do_tablet_operation()
Next patches will introduce an RPC handler to restore a tablet on
replica. The handler will be registered by sstables_loader, and it will
have to call that helper from storage_service which thus needs to be
moved to public scope.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
e5f04b0927 storage_service: Split transit_tablet() into two
The goal of the split is to have try_transit_tablet() that

- doesn't throw if tablet is in transition, but reports it back
- doesn't wait for the submitted transition to finish

The user will be in tablet-aware-restore, it will call this new trying
helper in parallel, then wait for all transitions to finish.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:39 +03:00
Pavel Emelyanov
dd51acf014 tablets: Add braces around tablet_transition_kind::repair switch
This is just to reduce the churn in the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:39 +03:00
Botond Dénes
866afe4c1e Merge ' db: add large data metrics for rows, cells, and collections' from Taras Veretilnyk
- Add `large_rows_exceeding_threshold`, `large_cell_exceeding_threshold`, and `large_collection_exceeding_threshold` metrics to complement the existing `large_partition_exceeding_threshold`
- Add unit tests verifying stats counters increment correctly during SSTable writes

Backport is not needed

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1095

Closes scylladb/scylladb#29722

* github.com:scylladb/scylladb:
  test/boost: add tests for large data stats counters
  db: add large data metrics for rows, cells, and collections
2026-05-12 10:04:53 +03:00
Pavel Emelyanov
30f1075544 utils: Replace local memory sink/source with seastar equivalents
Replace the local buffer_data_sink_impl and buffer_data_source_impl
classes in create_memory_sink() and create_memory_source() with
seastar::util::memory_data_sink and seastar::util::memory_data_source
respectively, which are now available upstream.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29616
2026-05-12 08:47:43 +03:00
Taras Veretilnyk
47b4fa920d test/boost: add tests for large data stats counters
Add test_large_data_stats_large_rows, test_large_data_stats_large_cells,
and test_large_data_stats_large_collections to verify that the
large_data_handler stats counters are correctly incremented during
SSTable writes and that unrelated counters remain at zero.
2026-05-11 23:42:14 +02:00
Taras Veretilnyk
881776b441 db: add large data metrics for rows, cells, and collections
Previously only large_partition_exceeding_threshold was exposed as a
metric. Add three new counters to large_data_handler::stats and register
corresponding Prometheus metrics:
- large_rows_exceeding_threshold
- large_cell_exceeding_threshold
- large_collection_exceeding_threshold

The counters are incremented in maybe_record_large_rows() and
maybe_record_large_cells() following the same pattern used by the
existing partition metric.
2026-05-11 23:11:17 +02:00
Anna Stuchlik
1f7d20f701 doc: label Migration from Vnodes to Tablets as experimental
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.

Fixes SCYLLADB-1932

Closes scylladb/scylladb#29834
2026-05-11 17:07:39 +03:00
Yaniv Michael Kaul
377bbeb076 docs: fix invalid UUID characters in examples
Replace UUIDs containing non-hexadecimal characters (like 'g', 'n', 'y')
with valid UUIDs in documentation examples.

Fixes #26797

Closes scylladb/scylladb#29674
2026-05-11 17:05:30 +03:00
Calle Wilund
2cc1a2c406 storage_service: Disable snapshots after raft decommission
Fixes: SCYLLADB-1693

In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.

This change moves it to after raft_decommission.

In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status

In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.

In the case of backup, this operates on a snapshot, state of which
is not affected.

Adds an injection point for testing.

v2:
- Added injection point to ensure test can abort decommission

Closes scylladb/scylladb#29667
2026-05-11 17:04:09 +03:00
Anna Stuchlik
4c01556f79 doc: mark Vector Search in Alternator as Cloud-only
This commit adds the information missing from the Alternator docs
that Vector Search is only available in ScyllaDB Cloud.

Fixes https://github.com/scylladb/scylladb/issues/29661

Closes scylladb/scylladb#29664
2026-05-11 17:03:20 +03:00
Avi Kivity
f5ffbd3c3e cql3: restrictions: reindent statement_restrictions.cc
6165124fcc has left statement_restrictions.cc scarred and
deformed. Restore it to standard 4-space indentation. This patch
contains only whitespace changes.

Closes scylladb/scylladb#29598
2026-05-11 17:02:14 +03:00
Yaniv Michael Kaul
3cba27d25f topology: propagate error messages through raft_topology_cmd_result
When a topology command (e.g., rebuild) fails on a target node, the
exception message was being swallowed at multiple levels:

1. raft_topology_cmd_handler caught exceptions and returned a bare
   fail status with no error details.
2. exec_direct_command_helper saw the fail status and threw a generic
   "failed status returned from {id}" message.
3. The rebuilding handler caught that and stored a hardcoded
   "streaming failed" message.

This meant users only saw "rebuild failed: streaming failed" instead
of the actionable error from the safety check (e.g., "it is unsafe
to use source_dc=dc2 to rebuild keyspace=...").

Fix by:
- Adding an error_message field to raft_topology_cmd_result (with
  [[version 2026.2]] for wire compatibility).
- Populating error_message with the exception text in the handler's
  catch blocks.
- Including error_message in the exception thrown by
  exec_direct_command_helper.
- Passing the actual error through to rtbuilder.done() instead of
  the hardcoded "streaming failed".

A follow-up test is in https://github.com/scylladb/scylladb/pull/29363

Fixes: SCYLLADB-1404

Closes scylladb/scylladb#29362
2026-05-11 17:01:15 +03:00
Yaniv Michael Kaul
cf9cde664c .github/workflows/call_sync_milestone_to_jira.yml: add missing workflow permissions
Add explicit empty permissions block (permissions: {}) since this
workflow only syncs milestones to Jira using its own secrets and needs
no GITHUB_TOKEN permissions. Fixes code scanning alert #171.

Closes scylladb/scylladb#29184
2026-05-11 17:00:10 +03:00
Raphael S. Carvalho
20fe1e6f68 replica: Improve diagnostics when tablet split fails due to non-empty split-unready groups
When finalizing a tablet split, all data must have been moved into
split-ready compaction groups before the storage groups can be remapped
to the new tablet count. If split-unready groups still hold data at that
point, handle_tablet_split_completion() calls on_internal_error(), which
previously only reported the tablet and table IDs — giving no insight
into why the split-unready groups were not empty.

Add fmt::formatter specializations for compaction_group and storage_group
so the full state of the offending storage_group is included in the error
message. The storage_group formatter emits:

  main=<cg>, merging=[<cg>...], split_ready=[<cg>...]

Each compaction_group formatter emits:

  [sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>]

where sstable_desc includes filename, origin, identifier and originating
host, memtable_empty reflects whether all memtables have been flushed,
and sstable_add_gate count reveals whether an in-flight sstable add is
holding data in the group.

Supporting changes:

- compaction_group: add memtable_empty() const noexcept (delegates to
  memtable_list::empty()) and a const overload of sstable_add_gate()
  so both are accessible from a const compaction_group reference inside
  the formatter.
- Promote sstable_desc from a local lambda in compaction_group_for_sstable
  to a static free function so it is reusable by the formatter.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29178
2026-05-11 16:59:05 +03:00
Yaniv Michael Kaul
3674deea54 scylla-gdb: display ms-format sstable summary from partitions db footer
For ms-format (trie-based) sstables, the traditional summary structure
is not populated. Instead, read equivalent metadata from the
_partitions_db_footer field: first_key, last_key, partition_count,
and trie_root_position.

This is a follow-up to the crash fix for SCYLLADB-1180, replacing the
informational-only message with actual useful output.

Refs: SCYLLADB-1180

Closes scylladb/scylladb#29164
2026-05-11 16:58:22 +03:00
Calle Wilund
db1b92c185 service::load_balancer: Add metrics for repair and rebuild count
Fixes #21115

Adds cluster counter for repairs, and dc counter for rebuilds

Closes scylladb/scylladb#28985
2026-05-11 16:57:46 +03:00
Piotr Smaron
71542206bc cql: return InvalidRequest for oversized partition/clustering keys
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".

Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).

The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.

Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
  test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
  test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
  testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.

test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.

Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.

Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").

fixes #10366
fixes #12247

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#23433
2026-05-11 16:56:35 +03:00
Piotr Smaron
959f67b345 cql: verify tuples length in multi-column IN restriction
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.

Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.

Fixes #13241

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#18407
2026-05-11 16:55:09 +03:00
Anna Stuchlik
a7b7019f90 doc: update the node size limit
This commit increases the node size limit from 256 to 4096 CPUs
based on be1f566488

Fixes SCYLLADB-1676

Closes scylladb/scylladb#29602
2026-05-11 16:38:53 +03:00
Nadav Har'El
f1b2b9bd52 Merge 'Register fulltext_index custom index type' from Dawid Pawlik
This PR adds the `fulltext_index` custom index class, laying the groundwork for full-text search in ScyllaDB. It focuses on the CQL-facing layer - schema validation, option parsing, and metadata - without implementing the search backend itself.

Users can now write:

```cql
CREATE CUSTOM INDEX ON t(content) USING 'fulltext_index'
WITH OPTIONS = {'analyzer': 'english', 'positions': 'false'};
```

The implementation follows the same custom index pattern established by vector search: a `custom_index` subclass registered in the factory map, with no backing materialized view. This keeps the door open for a CDC-based indexing pipeline similar to the one vector search uses.

As part of this work, the option validation helpers (`validate_enumerated_option`, `validate_positive_option`, `validate_factor_option`) were extracted from `vector_index.cc` into a shared header so both index types can reuse them. The `custom_index` base class also gained a virtual `index_type_name()` method, giving each subclass a self-describing name for error messages without hardcoding strings in shared code.

The PR is split into three commits:

1. Extract shared validation utilities and add `index_type_name()` to `custom_index`
2. Implement `fulltext_index` with column type and option validation
3. Integration tests covering creation, validation, describe, and metadata

Fixes: SCYLLADB-1517
Fixes: SCYLLADB-1510
References: SCYLLADB-1516

Closes scylladb/scylladb#29658

* github.com:scylladb/scylladb:
  test/cqlpy: add integration tests for `fulltext_index`
  index: unify custom index description
  index: add `fulltext_index` custom index implementation
  index: extract option validation helpers
2026-05-11 16:16:58 +03:00
Nadav Har'El
fcfad51284 Merge 'cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time' from Marcin Maliszkiewicz
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756
Backport: no, it's a minor fix and UDFs are experimental feature in Scylla

Closes scylladb/scylladb#29717

* github.com:scylladb/scylladb:
  test/cqlpy: add test for EXECUTE permission on UDA sub-functions
  cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
2026-05-11 16:14:38 +03:00
Botond Dénes
cf37f541a0 Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.

This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.

For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647

Backport is required, as it is a bug fix that was introduced in 517a4dc4df.

Closes scylladb/scylladb#29599

* github.com:scylladb/scylladb:
  sstables_loader: synchronously unlink streamed sstables before returning
  sstables: make sstable::unlink() idempotent
2026-05-11 14:43:46 +03:00
Asias He
0204372156 repair: Reject repair requests where start and end tokens are equal
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.

Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."

Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358

Closes scylladb/scylladb#29821
2026-05-11 14:08:20 +03:00
Botond Dénes
ad7ac62835 Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan  must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature

Closes scylladb/scylladb#29659

* github.com:scylladb/scylladb:
  db, sstables: add node_owner to sstables registry primary key
  db, sstables: rename sstables registry column owner to table_id
2026-05-11 14:08:19 +03:00
Botond Dénes
2edfb91070 sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
Replace the two remaining direct 'throw bufsize_mismatch_exception(...)'
call sites with the new throw_bufsize_mismatch_exception() helper, which
routes through throw_malformed_sstable_exception() and thus also respects
the --abort-on-malformed-sstable-error flag.

Affected files:
- sstables/sstables.cc (1 site, in check_buf_size())
- sstables/m_format_read_helpers.cc (1 site, in check_buf_size())
2026-05-11 11:58:14 +03:00
Botond Dénes
d65c1523c2 sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
2026-05-11 11:58:14 +03:00
Botond Dénes
84c27658d9 sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
Both functions now check abort_on_malformed_sstable_error() first. If
set, they log the error and call std::abort() directly, generating a
coredump. Otherwise they fall through to the existing on_internal_error()
path, which is in turn controlled by --abort-on-internal-error.
2026-05-11 11:58:14 +03:00
Botond Dénes
4ebcc002d6 sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after
seastar::testing::scoped_no_abort_on_internal_error) and use it in all
tests that intentionally corrupt sstables and expect
malformed_sstable_exception to be thrown rather than the process aborting.
2026-05-11 11:58:14 +03:00
Botond Dénes
f6dc2cb5f8 sstables: introduce --abort-on-malformed-sstable-error infrastructure
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.

The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
  helper functions in sstables/sstables.cc, which check the new flag and
  either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
  to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
  wired up in main.cc alongside abort_on_internal_error.

Call-site migration will follow in subsequent commits.
2026-05-11 11:58:14 +03:00
Botond Dénes
c3daa6379c sstables: refactor parse_path() to return std::expected<> instead of throwing
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).

Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.

Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.

Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
2026-05-11 11:58:14 +03:00
Marcin Maliszkiewicz
fa9d15d31a test/cqlpy: add test for EXECUTE permission on UDA sub-functions
Verify that SELECT of a UDA requires EXECUTE on its SFUNC, FINALFUNC,
and REDUCEFUNC individually.  If any one permission is missing, the
query must be rejected at planning time (even on an empty table).

The test is parameterized over the three sub-functions and uses
Lua on Scylla or Java on Cassandra, so it runs on both backends.
The REDUCEFUNC case is skipped on Cassandra since REDUCEFUNC is a
Scylla extension.

Refs SCYLLADB-1756
2026-05-11 10:23:39 +02:00
copilot-swe-agent[bot]
9e7d67612c docs: fix typo in materialized views docs - "columns are" instead of "is"
The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.

docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"

Fixes #29662
Closes #29663

Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2026-05-11 11:15:25 +03:00
Botond Dénes
eae15f4fdd Merge 'Share timeout_config between services' from Pavel Emelyanov
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller.  Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.

This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.

Components' dependencies cleanup, not backporting.

Closes scylladb/scylladb#29636

* github.com:scylladb/scylladb:
  storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
  alternator: Use shared updateable_timeout_config by reference
  cql_transport: Use shared updateable_timeout_config by reference
  storage_proxy: Use shared updateable_timeout_config by reference
  main: Introduce sharded<updateable_timeout_config>
  storage_proxy: Keep own updateable_timeout_config
2026-05-11 11:12:01 +03:00
Botond Dénes
9b2dfab2e5 Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.

Code dependencies refactoring, not backporting

Closes scylladb/scylladb#29635

* github.com:scylladb/scylladb:
  view: Turn calculate_view_update_throttling_delay into node_update_backlog member
  view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
  view: Add node_update_backlog reference to view_update_generator
2026-05-11 10:30:24 +03:00
Pavel Emelyanov
f39cbb1ec6 storage_proxy: Move maintenance_mode onto storage_proxy::config
Stop reading maintenance_mode through replica::database's db::config.
Add a properly typed maintenance_mode_enabled field to
storage_proxy::config, populate it in main.cc from cfg->maintenance_mode()
(same as messaging_service::config), and use a cached member in
storage_proxy instead of db.local().get_config().maintenance_mode().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29637
2026-05-11 10:11:20 +03:00
Yaniv Michael Kaul
631f1e1654 compaction: set_skip_when_empty() for validation_errors metric
Add .set_skip_when_empty() to compaction_manager::validation_errors.
This metric only increments when scrubbing encounters out-of-order or
invalid mutation fragments in SSTables, indicating data corruption.
It is almost always zero and creates unnecessary reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29349
2026-05-11 09:12:40 +03:00
Yaniv Michael Kaul
b8a150e22c build: add -ftime-trace support for compilation profiling
Add a --time-trace flag to configure.py and a Scylla_TIME_TRACE CMake
option that enable Clang's -ftime-trace on all C++ compilations. When
enabled, each .o file produces a companion .json trace that can be
analyzed with ClangBuildAnalyzer or loaded in chrome://tracing to
identify slow headers and costly template instantiations.

This is the first step toward data-driven build speed improvements.

Refs #1

Usage:
  configure.py:  ./configure.py --time-trace --mode dev
  CMake:         cmake -DScylla_TIME_TRACE=ON -DCMAKE_BUILD_TYPE=Dev ..

Closes scylladb/scylladb#29462
2026-05-11 08:55:33 +03:00
Dmitry Kropachev
85d0011b3c gitignore: add missing rust build artifacts
rust/**/target and Cargo.lock files under rust/inc/ and
rust/wasmtime_bindings/ were not ignored, nor was
test/resource/wasm/rust/target/.

Closes scylladb/scylladb#28943
2026-05-11 07:06:26 +03:00
Botond Dénes
3f72852d8c Merge 'Fix missing format string placeholders across the codebase (33 bugs across 14 modules )' from Yaniv Kaul
Fix 28 format string bugs plus 5 related format argument bugs across 14 modules
where `{}` placeholders were missing or arguments were wrong, causing arguments to
be silently dropped or misleading output from the `{fmt}` library.

Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single
instance in `replica/table.cc`), a comprehensive audit of the entire codebase was
performed to find all similar issues.

- **Missing `{}` placeholder** (21 instances): format string simply lacks `{}` for a
  passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id`
  is silently dropped
- **Spurious comma breaking C++ string literal concatenation** (2 instances): a comma
  after a string literal prevents adjacent-literal concatenation, turning the
  continuation into a format argument instead of part of the format string
- **Printf-style `%s` in fmtlib context** (4 instances): `%s` has no meaning in fmtlib
  and appears as literal text while the argument is silently ignored
- **Extra spurious argument** (1 instance): an extraneous `t.tomb()` argument inserted
  between correct arguments, causing wrong values in the wrong slots

- **Wrong variable in error message** (4 instances in `types/map.hh`): error messages
  for oversized map keys/values reported `map_size` (total entry count) instead of the
  actual `elem.first.size()` or `elem.second.size()` that exceeded the limit
- **Swapped argument order** (1 instance in `data_dictionary/data_dictionary.cc`):
  format string says `"Extraneous options for {type}: {values}"` but the values and
  type arguments were passed in reverse order

| Module | Bugs Fixed | Files |
|--------|:---------:|-------|
| `replica/` | 1 | `table.cc` |
| `service/` | 4 | `raft_group0.cc`, `storage_service.cc` |
| `db/` | 6 | `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` |
| `cql3/` | 2 | `prepare_expr.cc`, `statement_restrictions.cc` |
| `transport/` | 4 | `event_notifier.cc` |
| `sstables/` | 3 | `partition_reversing_data_source.cc`, `reader.cc` |
| `alternator/` | 1 | `conditions.cc` |
| `cdc/` | 1 | `split.cc` |
| `raft/` | 1 | `server.cc` |
| `utils/` | 2 | `gcp/object_storage.cc`, `s3/client.cc` |
| `mutation/` | 1 | `mutation_partition.hh` |
| `ent/` | 2 | `kmip_host.cc`, `kms_host.cc` |
| `types/` | 4 | `map.hh` |
| `data_dictionary/` | 1 | `data_dictionary.cc` |

The `{fmt}` library's compile-time checker validates that each `{}` placeholder
references a valid argument, but does **not** verify the reverse -- that every
argument has a corresponding placeholder. Extra arguments are silently ignored
at both compile time and runtime.

Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly.

---

**Note:** Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul".

Closes scylladb/scylladb#29448

* github.com:scylladb/scylladb:
  data_dictionary: fix swapped arguments in extraneous options error
  types: fix wrong variable in map key/value size error messages
  ent: fix missing format placeholders in encryption error/log messages
  mutation: fix spurious argument in shadowable_tombstone formatter
  utils: fix missing format placeholders in object storage log messages
  raft: fix missing format placeholder in server ostream operator
  cdc: fix missing format placeholder in error message
  alternator: fix missing format placeholder in error message
  sstables: fix missing format placeholders in error messages
  transport: fix printf-style format specifiers in fmtlib log calls
  cql3: fix missing format placeholders in error messages
  db: fix missing format placeholders in log and error messages
  service: fix missing format placeholders in log messages
  replica: fix missing format placeholder in cleanup log message
2026-05-11 07:04:42 +03:00
Yaron Kaikov
5694c93c12 build: add collect-dist target to organize build artifacts
Build artifacts are currently scattered across
build/dist/$mode/redhat/, tools/python3/build/, tools/cqlsh/build/, etc. with unpredictable names. Add a new 'collect-dist' ninja target that
gathers all distributable artifacts into a well-known structure:

  build/$mode/dist/rpm/       -- all binary RPMs (no SRPMs)
  build/$mode/dist/deb/       -- all .deb packages
  build/$mode/dist/tar/       -- relocatable tarballs (already here)

The collection is done via a reusable 'collect_pkgs' ninja rule defined
directly in configure.py, which knows all the source paths. No external
script is needed.

Fixes: SCYLLADB-75

Closes scylladb/scylladb#29475
2026-05-11 06:54:29 +03:00
Michael Litvak
274024a76b configure.py: update compile_commands.json if stale
configure.py creates compile_commands.json in the root directory as a
symbolic link to the file in one of the build directories. If the file
already exists it does nothing.

However it may happen that the file exists but the target file does not
exist. For example, if the build directory is removed and then building
with a different mode. Then the file will remain as a stale symbolic
link.

To address this, when the file exists check also if it's a valid
symbolic link. If not, then recreate it with a valid target.

Closes scylladb/scylladb#29680
2026-05-10 22:17:16 +03:00
Piotr Szymaniak
459c1dc32f test/alternator: stop avoiding tablets in Streams tests
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.

Refs SCYLLADB-463

Closes scylladb/scylladb#29697
2026-05-10 22:13:15 +03:00
Nadav Har'El
df8c9b17b8 Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.

Fixes SCYLLADB-1680
Fixes #16367

To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.

This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.

Closes scylladb/scylladb#29604

* github.com:scylladb/scylladb:
  test: Stop providing alternator-streams experimental flag
  alternator: Graduate Alternator Streams from experimental
2026-05-10 22:10:03 +03:00
Nadav Har'El
34136d3bc2 Merge 'vector_search: test: migrate CQL tests for vector search from C++/Boost to pytest' from Karol Nowacki
Migrate vector search (ANN ordered select query) CQL tests from C++/Boost suite to pytest.

This migration includes:
- New pytest tests in `test/cqlpy/test_vector_search_with_vector_store_mock.py`
- VectorStoreMock server as pytest fixture to simulate vector store responses

The benefits of this migration are:
- Extended test coverage to verify CQL protocol serialization and driver
- Reduced overall test time (no compilation required for pytest)

Fixes SCYLLADB-695

No backport needed as this is a refactoring.

Closes scylladb/scylladb#29593

* github.com:scylladb/scylladb:
  vector_search: test: migrate paging warnings tests to Python
  vector_search: test: migrate local_vector_index to Python
  vector_search: test: migrate vector_index_with_additional_filtering_column to Python
  vector_search: test: migrate cql_error_contains_http_error_description to Python
  vector_search: test: migrate pk in restriction test to Python
2026-05-10 22:09:17 +03:00
Nadav Har'El
d4aa528834 Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.

`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`

The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.

Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.

Fixes: SCYLLADB-1664

This fix needs to be backported to versions: 2025.4, 2026.1

Closes scylladb/scylladb#29585

* github.com:scylladb/scylladb:
  test: verify load balancer handles dropped tables gracefully
  tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
2026-05-10 22:07:51 +03:00
Nadav Har'El
63927e07ea Merge 'alternator/streams: keep disabled streams usable and purge on re-enable' from Piotr Szymaniak
When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.

On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239
Fixes SCYLLADB-523

Closes scylladb/scylladb#29413

* github.com:scylladb/scylladb:
  alternator/streams: remove dead next_iter in get_records
  test/alternator: fix stream wait timeouts to use wall-clock time
  docs/alternator: document stream disable/re-enable behavior
  alternator/streams: keep disabled streams usable and purge on re-enable
2026-05-10 22:04:35 +03:00
Nadav Har'El
e277f747bd Merge 'Make collection unfreezing more efficient' from Botond Dénes
Introduce `read_from_collection_cell_view()` which reads a `collection_mutation` directly from the IDL representation of a collection (`ser::collection_cell_view`). This cuts down the number of allocations required drastically compared to the current method of:

    IDL -> collection_mutatio_description -> collection_mutation

Reduces the number of allocations to unfreeze a collection from O(collection_cell_count) -> O(1) (actually, due to buffer fragmentation, it is O(collection_size)).
The new method is used when unfreezing frozen mutations and frozen mutation fragments. This is on the hot path: all writes with collections benefit.

Add a `--collection` flag to `perf-simple-query` to allow measuring the performance improvement of this PR.
With  `dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write`  the number of allocations drop from ~123 to 102, which is a significant amount of allocations shaved off.

Refs: https://github.com/scylladb/scylladb/issues/3602 (solves one use-case out of the many listed therein)
Fixes: SCYLLADB-1046
Fixes: SCYLLADB-1077

Backport: this is an optimization so normally not a backport candidate, but we may have to backport to relieve certain customers

Closes scylladb/scylladb#29033

* github.com:scylladb/scylladb:
  test/perf/perf_simple_query: add --collection=N
  test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections
  mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
  mutation/collection_mutation: introduce read_from_collection_cell_view()
  mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
  mutation/collection_mutation: generalize serialize_collection_mutation
  mutation/mutation_partition_view: avoid copying collection
  mutation/mutation_partition_view: accept collection_mutation in the consume API
  partition_builder: add move variant of accept_*_cell() collection overloads
2026-05-10 20:39:08 +03:00
Nadav Har'El
2501a22b10 alternator: remove unneeded call to format()
Removed a silly call to format() on a constant string without parameters.
2026-05-10 20:34:36 +03:00
Nadav Har'El
b3a62dc9d2 alternator: improve CONTAINS operator's validity checking
Copilot who review the implementation of the CONTAINS operator
complained that in some places we assume without checking that the
user-providing parameter to CONTAINS has the expected structure.

Not doing all the checks explicitly is actually not terrible in
RapidJSON, because its methods like BeginMembers() always validate the
type before trying to follow a pointer, throwing an exception if it
the JSON value doesn't have the right type. But it's still cleaner
to do these checks explicitly, and throw a clean SerializationError
instead of some internal server error. So this is what this patch does.

If the malformed object doesn't come from the query but rather comes
from the data, we just silently return false. This is our usual
convention - we don't expect malformed data in our database, but if
we do have some (see issue #8070) we shouldn't tell the user that
there was an error in his completely valid query.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-10 20:34:36 +03:00
Yaniv Kaul
a6cf45f9e2 data_dictionary: fix swapped arguments in extraneous options error
The format string says "Extraneous options for {type}: {values}"
but the arguments were passed in the wrong order (values first, type
second), producing misleading error messages like
"Extraneous options for bucket,endpoint: S3" instead of
"Extraneous options for S3: bucket,endpoint".

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a13da94308 types: fix wrong variable in map key/value size error messages
Four error messages for oversized map keys/values reported map_size
(the total number of entries) instead of the actual key or value size
that exceeded the limit. The condition checks elem.first.size() or
elem.second.size(), but the error message printed map_size. This
affects both the bytes and managed_bytes serialization overloads.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
bf1d59ad95 ent: fix missing format placeholders in encryption error/log messages
Fix two format string bugs:

- kmip_host.cc: cmd_in was passed as an argument to a trace log but
  had no {} placeholder, so the command was silently dropped.
- kms_host.cc: the XML node name (what) was passed to the error
  message but had no {} placeholder, so the error never showed which
  XML node was missing.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a76774f8f9 mutation: fix spurious argument in shadowable_tombstone formatter
The formatter for shadowable_tombstone had a spurious t.tomb()
argument between the timestamp and deletion_time arguments. This
caused t.tomb() (the whole tombstone) to be formatted into the
deletion_time={} slot, while the actual deletion_time count was
silently dropped. Remove the extra argument.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
700b0b4c28 utils: fix missing format placeholders in object storage log messages
Fix two format string bugs:

- gcp/object_storage.cc: _session_path was passed but the format
  string had empty parentheses () instead of ({}), so the session
  path was silently dropped from the debug output.
- s3/client.cc: part_number was passed as an argument but had no {}
  placeholder. The upload_id ended up in the etag slot and was
  silently dropped. Add {} for all three values.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
358f6fba9f raft: fix missing format placeholder in server ostream operator
The FSM state was passed as an argument but the format string had
empty parentheses () instead of ({}), causing the FSM state to be
silently dropped from the output.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
605455f82d cdc: fix missing format placeholder in error message
The collection type name was passed as an argument but the format
string only had a trailing colon without a {} placeholder, so the
type name was silently dropped from the error message.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
0c88ff6a40 alternator: fix missing format placeholder in error message
The values count was passed as an argument but had no {} placeholder,
so it was silently dropped. The analogous BETWEEN check on the line
above correctly uses {} -- apply the same pattern here.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
e29f59347b sstables: fix missing format placeholders in error messages
Fix three format string bugs:

- partition_reversing_data_source.cc: _row_start was passed as an
  argument but had no {} placeholder in the invariant error message.
  Add {} for all three values to show the full diagnostic.
- reader.cc: two "Invalid boundary type" error messages passed the
  type value as an argument but had no {} placeholder, so the actual
  invalid type was never shown.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
413497c9ce transport: fix printf-style format specifiers in fmtlib log calls
Four logger calls used %s (printf-style) instead of {} (fmtlib-style),
causing __func__ to be silently ignored and the literal text "%s" to
appear in the log output. The same file already uses {} correctly in
the on_create_function and on_create_aggregate handlers.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
cfb568b5b5 cql3: fix missing format placeholders in error messages
Fix two format string bugs where arguments were silently dropped:

- prepare_expr.cc: the bad argument to count() was passed but had no
  {} placeholder, so users never saw what was actually passed.
- statement_restrictions.cc: the unsupported multi-column relation was
  passed but the trailing colon had no {} placeholder.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
fdebed5746 db: fix missing format placeholders in log and error messages
Fix six format string bugs where arguments were silently dropped:

- heat_load_balance.cc: pp value was passed but had no {} placeholder.
- commitlog_replayer.cc: column_family_id was passed but table= had
  no {} placeholder.
- view_update_generator.cc: _sstables_with_tables.size() was passed
  but had no {} placeholder.
- view_building_worker.cc: exception pointer was passed but the
  trailing colon had no {} placeholder.
- row_locking.cc: partition key and clustering key were passed in
  error messages but had no {} placeholders.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Yaniv Kaul
4ee81f9b32 service: fix missing format placeholders in log messages
Fix four format string bugs:

- raft_group0.cc: the exception from sleep_and_abort was passed as an
  argument but had no {} placeholder, so it was silently dropped.
- storage_service.cc: loading topology trace was missing a placeholder
  for the cleanup field (9 args but only 8 placeholders).
- storage_service.cc: two join-rejection warnings had a spurious comma
  after the first string literal, breaking C++ string concatenation.
  This caused the continuation string to be treated as a separate
  format argument instead of being part of the format string, and
  params.host_id was silently dropped.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Yaniv Kaul
f75248a734 replica: fix missing format placeholder in cleanup log message
The log message for tablet cleanup invalidation was missing a {}
placeholder for the table name (cf_name), causing it to be silently
dropped from the output. Add {}.{} to show both keyspace and table
name, consistent with the convention used elsewhere in the file.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Piotr Dulikowski
bc482bfdea database: add missing co_await on lock in create_local_system_table
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.

Fix this by adding the co_await that was missing.

Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.

Closes scylladb/scylladb#29806
2026-05-10 15:36:21 +03:00
Avi Kivity
5a887362e3 Merge 'Remove legacy tables creation code' from Gleb Natapov
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.

No backport needed since this removes functionality.

Closes scylladb/scylladb#29482

* github.com:scylladb/scylladb:
  db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
  db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
  db/system_distributed_keyspace: remove unused code
  db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
  db/system_distributed_keyspace: drop old service_levels table
  fix indent after the previous patch
  group0: call setup_group0 only when needed
2026-05-10 14:46:21 +03:00
Botond Dénes
67226e6f1b scylla-gdb.py: interval_printer: update for new layout
interval switched from std::optional<> to union + bools for bound
storage in 42d7ae1082.
Update the printer to work with the new layout. Keep the code
backwards compatible, 2025.1 still uses optionals and is still
supported.

Closes scylladb/scylladb#29738
2026-05-10 14:28:24 +03:00
Avi Kivity
ece4e0738f Merge 'docs/cql: fix syntax errors in CQL examples' from Yaniv Kaul
Fix 4 genuine CQL syntax errors in documentation examples, found by automated extraction and execution of doc code blocks against a live ScyllaDB instance.

- **insert.rst**: `USING TTL 86400 IF NOT EXISTS` → `IF NOT EXISTS USING TTL 86400` (wrong clause order produces syntax error)
- **ddl.rst**: Missing opening quote in ALTER KEYSPACE example (`dc2'` → `'dc2'`)
- **ddl.rst**: Hyphenated column names need double-quoting; also fix PRIMARY KEY referencing non-existent `customer_id` instead of `cust_id`
- **types.rst**: UDT `address` contains nested collections, so it must be `frozen<address>` when used as a column type

Built a CQL extractor that parses `.. code-block:: cql` blocks from RST docs, then executed all 194 extracted statements against ScyllaDB 2026.2.0-rc0. These 4 are confirmed syntax/semantic errors in the documentation.

Closes scylladb/scylladb#29765

* github.com:scylladb/scylladb:
  test/cqlpy: add tests for hyphenated column names
  docs/cql: fix UDT example to use frozen<address>
  docs/cql: fix CREATE TABLE example with hyphenated column names
  docs/cql: fix missing opening quote in ALTER KEYSPACE example
  docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
2026-05-10 14:23:30 +03:00
Anna Stuchlik
61d1cbfd20 doc: add the upgrade guide from 2026.1 to 2026.2
This commit adds the upgrade guide, including the updated metrics.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1746

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1765

Closes scylladb/scylladb#29694
2026-05-10 14:20:09 +03:00
Benny Halevy
a797c9f10b table: delete sstables atomically per compaction group during truncate
Prepare for truncate of tables on object storage, where we want to
limit the atomic deletion batches to produce smaller batch mutations.

This is safe since truncate does not really need to delete all sstables
in the table atomically — it is already non-atomic since each node and
each shard deletes its own sstables. The atomic deletion mechanism is
used for convenience.

Previously, discard_sstables collected all sstables from all compaction
groups on the shard into a single vector and issued one atomic delete
for all of them. Change to track removed sstables per compaction group
and issue separate atomic deletes per group using
coroutine::parallel_for_each, allowing concurrent deletion across
groups.

Closes scylladb/scylladb#29789
2026-05-10 14:08:10 +03:00
Botond Dénes
d0813769ec sstables/trie: add preemption points in trie_writer
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.

The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node

Fixes SCYLLADB-1885

Closes scylladb/scylladb#29798
2026-05-10 11:30:59 +03:00
Marcin Maliszkiewicz
fb55bef0ac cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes SCYLLADB-1756
2026-05-08 16:37:52 +02:00
Botond Dénes
8ca0f2dd54 Merge 'raft: do not throw commit_status_unknown from add_entry when possible' from Patryk Jędrzejczak
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.

Three sites are changed:

1. wait_for_entry (truncation case): the snapshot-term match optimization
   that proved the entry was committed now applies to both wait_type::committed
   and wait_type::applied, not just committed.

2. wait_for_entry (snapshot covers entry): instead of throwing
   commit_status_unknown when the snapshot index >= entry index, return
   successfully. The entry's effects are included in the state machine's
   state via the snapshot.

3. drop_waiters: when called from load_snapshot, pass the snapshot term.
   Waiters whose term matches the snapshot term are resolved successfully
   (set_value) instead of failing with commit_status_unknown, since the
   Log Matching Property guarantees they were committed and included.

This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.

The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.

Fixes: SCYLLADB-1264

No backport: the changes are quite risky and the test being fixed
fails very rarely.

Closes scylladb/scylladb#29685

* github.com:scylladb/scylladb:
  test/raft: fix duplicate check in connected::operator()
  test/raft: add tests for add_entry snapshot interactions
  raft: do not throw commit_status_unknown from add_entry when possible
  raft: change drop_waiters parameter from index to snapshot descriptor
  raft: server: fix a typo
2026-05-08 16:39:52 +03:00
Dawid Pawlik
b6d5ff344b test/cqlpy: add integration tests for fulltext_index
Add `test_fulltext_index.py` covering the `fulltext_index` custom index:
- Creation on text, varchar, and ascii columns
- Rejection of non-text types (int, blob, vector)
- Validation of analyzer and positions options
- Rejection of unsupported option keys
- Case-insensitive class name lookup
- DESCRIBE INDEX output with and without options
- No backing materialized view in `system_schema.views`
- IF NOT EXISTS idempotent behavior
- Metadata correctness in `system_schema.indexes`
2026-05-08 11:30:08 +02:00
Dawid Pawlik
2076164af9 index: unify custom index description
Move common description logic into a protected helper
`describe_with_target` on `custom_index`, so subclasses can delegate
to it when implementing the `describe()` virtual method.
2026-05-08 11:30:08 +02:00
Dawid Pawlik
fcd15b5cd4 index: add fulltext_index custom index implementation
Introduce `fulltext_index`, a new `custom_index` subclass
for full-text search (FTS).

The index validates that the target column is a text type
(text, varchar, or ascii) and supports two WITH OPTIONS keys:
- 'analyzer': one of standard, english, german, french, spanish,
  italian, portuguese, russian, chinese, japanese, korean, simple,
  whitespace
- 'positions': boolean controlling whether term positions are stored

`view_should_exist()` returns false — no backing materialized view is
created, matching the CDC-backed pattern used by `vector_index`.

Fixes: SCYLLADB-1517
2026-05-08 11:30:08 +02:00
Dawid Pawlik
a396129e5c index: extract option validation helpers
Move `validate_enumerated_option`, `validate_positive_option`,
and `validate_factor_option` into shared index option utilities
under the `secondary_index::util` namespace. These functions were
previously defined as file-local statics in `vector_index.cc` with
hardcoded index names in error messages.

The shared versions take `index_type_name` as a parameter, allowing
each `custom_index` subclass to pass its own name via the virtual
`index_type_name()` method at the call site. The options maps use
`std::bind_front` to bind config params (supported values, limits),
leaving `index_name` as the first unbound argument passed by
`check_index_options()`.

Add `index_type_name()` as a pure virtual method on `custom_index`.
Move the shared utility implementations into `index_option_utils.cc`
and update `vector_index.cc` to use them.
2026-05-08 11:28:39 +02:00
Patryk Jędrzejczak
4c3a86c515 test/raft: fix duplicate check in connected::operator()
The operator had a copy-paste bug: it checked
disconnected.contains({id1, id2}) twice instead of checking both
directions ({id1, id2} and {id2, id1}).

Reduce the operator to a single directional check: {id1, id2}. It works
for all current callers, and checking both directions correctly would
break the new block_receive() function.
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
ccd92c0b6b test/raft: add tests for add_entry snapshot interactions
Add six tests covering add_entry with wait_type::applied and
wait_type::committed for three snapshot scenarios affected in the
previous commit:

1. Snapshot at the entry's index (wait_for_entry, term_for returns
   snapshot term).

2. Snapshot past the entry's index (wait_for_entry, term_for returns
   nullopt).

3. Follower's waiter is resolved via drop_waiters when a snapshot
   is loaded.

Without the fix in the previous commit, 4 of 6 tests fail:
all 3 wait_type::applied tests and the wait_type::committed
drop_waiters test. The remaining two tests pass because the changes
don't affect them.

We don't write tests covering the scenarios when add_entry should
still throw commit_status_unknown (that is when the entry's term
doesn't match the snapshot's term) because:
- these tests would be very complicated,
- a bug that would make these tests fail should also make the
  nemesis tests fail, as there would be an issue with linearizability.
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
a7f204ee45 raft: do not throw commit_status_unknown from add_entry when possible
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.

Three sites are changed:

1. wait_for_entry (truncation case): the snapshot-term match optimization
   that proved the entry was committed now applies to both wait_type::committed
   and wait_type::applied, not just committed.

2. wait_for_entry (snapshot covers entry): instead of throwing
   commit_status_unknown when the snapshot index >= entry index, return
   successfully. The entry's effects are included in the state machine's
   state via the snapshot.

3. drop_waiters: when called from load_snapshot, pass the snapshot term.
   Waiters whose term matches the snapshot term are resolved successfully
   (set_value) instead of failing with commit_status_unknown, since the
   Log Matching Property guarantees they were committed and included.

This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.

The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.

Fixes: SCYLLADB-1264
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
e2217c143f raft: change drop_waiters parameter from index to snapshot descriptor
Change drop_waiters(std::optional<index_t> idx) to
drop_waiters(const snapshot_descriptor* snp). The only caller that passes
an index is load_snapshot, which already has the full snapshot descriptor.
Using it directly makes the parameter self-documenting and prepares for the
following commit which will also need the snapshot term (a field of
snapshot_descriptor).
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
3219786ab8 raft: server: fix a typo 2026-05-08 11:18:01 +02:00
Botond Dénes
a30ce98bc4 Merge 'test: speed up sstable compaction tests on remote storage (S3/GCS)' from Ernest Zaslavsky
Several sstable_compaction_test cases run prohibitively slowly on S3 and GCS backends — some taking 4+ minutes — because they create hundreds of SSTables sequentially over high-latency HTTP connections and perform redundant validation (checksumming) round-trips on every one. The twcs_reshape_with_disjoint_set S3 variant was even disabled entirely because of this.

The changes apply three complementary optimizations, per-test:

**Skip SSTable validation on remote storage.** The compaction tests verify strategy logic, not data integrity. SSTable validation triggers additional read-back I/O which is cheap on local disk but expensive over HTTP. A `do_validate` flag now conditionally skips validation when the storage backend is not local.

**Parallelize SSTable creation with async coroutines.** A new `make_sstable_containing_async` coroutine overload is added alongside the existing synchronous `make_sstable_containing`. Sequential creation loops are replaced with `parallel_for_each` using coroutine lambdas that call the async overload directly, overlapping S3/GCS uploads without spawning a dedicated Seastar thread per SSTable. The async validation path performs the same content checks as the synchronous version (mutation merging and `is_equal_to_compacted` assertions). Operations that depend on the created SSTables (e.g. `add_sstable_and_update_cache`, `owned_token_ranges` population) remain sequential.

**Reduce SSTable count for remote variants.** Tests like twcs_reshape_with_disjoint_set and stcs_reshape_overlapping used a hardcoded count of 256. The count is now a function parameter (default 256 for local, 64 for S3/GCS), which is sufficient to exercise the compaction strategy logic while avoiding excessive remote I/O.

Infrastructure changes: S3 endpoint max_connections raised from the default to 32 to support the higher upload concurrency, and trace-level logging added for s3, gcp_storage, http, and default_http_retry_strategy to aid future debugging.

The previously disabled twcs_reshape_with_disjoint_set_s3_test is re-enabled with these optimizations.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1428
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1843

No backport needed — this is a test-only performance improvement.

Closes scylladb/scylladb#29416

* github.com:scylladb/scylladb:
  test: optimize compaction_strategy_cleanup_method for remote storage
  test: optimize stcs_reshape_overlapping for remote storage
  test: optimize twcs_reshape_with_disjoint_set for remote storage
  test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
  test: parallelize SSTable creation in run_incremental_compaction_test
  test: parallelize SSTable creation in offstrategy_sstable_compaction
  test: parallelize SSTable creation in twcs_partition_estimate
  test: add trace-level logging for S3 and HTTP in compaction tests
  test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
  test: move make_memtable out of external_updater in row_cache_test
  test: increase S3 max connections for compaction tests
2026-05-08 06:40:20 +03:00
Piotr Szymaniak
bc69fd7f11 alternator/streams: remove dead next_iter in get_records
The variable was constructed but never used — the original iterator
is returned instead. Fix the misleading comment to explain the
open-shard semantics of returning the original iterator.
2026-05-07 14:45:42 +02:00
Piotr Szymaniak
744848a85f test/alternator: fix stream wait timeouts to use wall-clock time
Both disable_stream and wait_for_active_stream used time.process_time()
for their timeouts, but process_time measures CPU time, not wall-clock
time. Since these loops spend most of their time sleeping and waiting on
API calls, the timeouts could last far longer than intended. Use
time.time() instead to enforce actual wall-clock deadlines.
2026-05-07 14:45:42 +02:00
Piotr Szymaniak
04b9214cf5 docs/alternator: document stream disable/re-enable behavior 2026-05-07 14:45:42 +02:00
Piotr Szymaniak
38bd068f78 alternator/streams: keep disabled streams usable and purge on re-enable
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).

Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
  false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
  indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
  and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
  that disabled streams whose log table still exists are listed.

When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.

The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.

Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
  now that disabled streams are usable.
- test_streams_reenable: new test verifying that re-enabling produces
  a new ARN and the old data is still readable via the old ARN (xfail
  because Scylla currently purges old data on re-enable).

Fixes scylladb/scylladb#7239
2026-05-07 14:45:42 +02:00
Ferenc Szili
f7bc8f5fa7 test: boost: add drain test for forced capacity-based balancing
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.

The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
2026-05-07 13:56:36 +02:00
Ferenc Szili
906d2b817e service: allow draining with forced capacity-based balancing
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.

When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.

This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
2026-05-07 13:44:53 +02:00
Wojciech Mitros
ab12083525 test: propagate view update backlog before partition delete
In the test_delete_partition_rows_from_table_with_mv case we perform
a deletion of a large partition to verify that the deletion will
self-throttle when generating many view updates.
Before the deletion, we first build the materialized view, which causes
the view update backlog to grow. The backlog should be back to empty
when the view building finishes, and we do wait for that to happen, but
the information about the backlog drop may not be propagated to the
delete coordinator in time - the gossip interval is 1s and we perform
no other writes between the nodes in the meantime, so we don't make use
of the "piggyback" mechanism of propagating view backlog either. If the
coordinator thinks that the backlog is high on the replica, it may reject
the delete, failing this test.
We change this in this patch - after the view is built, we perform an
extra write from the coordinator. When the write finishes, the coordinator
will have the up-to-date view backlog and can proceed with the DELETE.
Additionally, we enable the "update_backlog_immediately" injection, which
makes the node backlog (the highest backlog across shards) update immediately
after each change.

Fixes: SCYLLADB-1795

Closes scylladb/scylladb#29775
2026-05-07 11:33:13 +03:00
Jenkins Promoter
454a8e6966 Update pgo profiles - aarch64 2026-05-07 10:09:36 +03:00
Andrzej Jackowski
eb241a7048 test: make preemptive abort coverage deterministic
The test used a real-time sleep to move the queued permit into the
preemptive-abort window. If the reactor did not get CPU for long
enough, admission could run only after the permit's timeout had
expired, making the expected abort path flaky.

The test also exhausted memory together with count resources, so the
queued permit could wait for memory. Preemptive abort is intentionally
not applied to permits waiting for memory, so keep enough memory
available and assert that the permit is queued only on count.

Use an immediate preemptive-abort threshold and a long finite timeout
to exercise admission-time abort without relying on scheduler timing.

Fixes: SCYLLADB-1796

Closes scylladb/scylladb#29736
2026-05-07 09:59:53 +03:00
Jenkins Promoter
5385df02ec Update pgo profiles - x86_64 2026-05-07 09:22:20 +03:00
Patryk Jędrzejczak
25fd1001c2 Merge 'alternator: improve CreateTable/UpdateTable schema agreement timeout' from Nadav Har'El
CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky.

This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long.

The patch also improves the behavior of a schema-agreement timeout, when it happens:

1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait.

Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)

This patch is only important to improve test stability in extremely slow test machines where schema agreement sometimes (very rarely) takes over 10 seconds. It's not important to backport it to branches that don't run CI very often on slow machines.

Closes scylladb/scylladb#29744

* https://github.com/scylladb/scylladb:
  alternator: improve CreateTable/UpdateTable schema agreement timeout
  migration_manager: unique timeout exception for wait_for_schema_agreement()
2026-05-06 16:56:46 +02:00
Ferenc Szili
ec4b483e88 test: fix flaky test_tablets_split_merge_with_many_tables
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.

It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.

Fixes: SCYLLADB-1717

Closes scylladb/scylladb#29682
2026-05-06 17:02:10 +03:00
Petr Gusev
cab043323d test/cluster: fix test_lwt_fencing_upgrade flakiness during rolling upgrade
Replace the naive host.is_up check with wait_for_cql_and_get_hosts() which
actually executes a query against each host, ensuring the driver's connection
pool is fully re-established before proceeding to stop the last server.

The is_up flag is set asynchronously via gossip and doesn't guarantee the
connection pool has live TCP connections. After a server restart, the flag
may be True while the pool still holds stale connections. When the pool
monitor later discovers them dead it briefly marks the host DOWN, causing
NoHostAvailable if another server is being stopped concurrently.

Fixes SCYLLADB-1840

Closes scylladb/scylladb#29769
2026-05-06 15:40:09 +03:00
Tomasz Grabiec
d6346e68c1 Merge 'prevent gossiper from marking nodes as down in tests unexpectedly' from Patryk Jędrzejczak
This PR includes two changes that make gossiper much less likely to mark
nodes as down in tests unexpectedly, and cause test flakiness in issues
like SCYLLADB-864:
- fixing false node conviction when echo succeeds,
- increasing the failure_detector_timeout fixture.

Fixes: SCYLLADB-864

No need for backport: related CI failures are rare, and merging #29522
made them even more unlikely (I haven't seen one since then, but it's
still possible to reproduce locally on dev machines).

Closes scylladb/scylladb#29755

* github.com:scylladb/scylladb:
  test/cluster: increase failure_detector_timeout
  gossiper: fix false node conviction when echo succeeds
2026-05-06 14:01:15 +02:00
Piotr Dulikowski
1dccfeb988 Merge 'vector_search: test: fix flaky test_dns_resolving_repeated' from Karol Nowacki
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.

Fixes SCYLLADB-1794

Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.

Closes scylladb/scylladb#29752

* github.com:scylladb/scylladb:
  vector_search: test: default timeout in test_dns_resolving_repeated
  vector_search: test: fix flaky test_dns_resolving_repeated
2026-05-06 13:46:36 +02:00
Botond Dénes
8d22ef3058 Merge 'commitlog_test.py: Fix size check aliasing, and threshold calc and fix CL chunk size est.' from Calle Wilund
Fixes: SCYLLADB-1815

If we're in a brand new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.

As for the test:
Checking segment sizes should not use a size filter that rounds (up) sizes.
More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.

Closes scylladb/scylladb#29753

* github.com:scylladb/scylladb:
  commitlog_test.py: Fix size check aliasing, and threshold calc.
  commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
2026-05-06 13:48:41 +03:00
Piotr Dulikowski
321006ecbd Merge 'auth: fix crash on ghost rows in role_permissions' from Marcin Maliszkiewicz
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.

When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.

As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.

The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816

Backport: all supported versions 2026.1, 2025.4, 2025.1

Closes scylladb/scylladb#29757

* github.com:scylladb/scylladb:
  test: add reproducer for auth cache crash on missing permissions column
  auth: tolerate missing permissions column in authorize()
  auth: add defensive has() guard for role_attributes value column
  auth: remove unused permissions field from cache role_record
2026-05-06 12:00:17 +02:00
Yaron Kaikov
65eabda833 pgo: fix ModuleNotFoundError in exec_cql.py by reverting safe_driver_shutdown
Commit cf237e060a introduced 'from test.pylib.driver_utils import
safe_driver_shutdown' in pgo/exec_cql.py. This module runs during PGO
profile training (a build step) where the test package is not on the
Python path, causing an immediate ModuleNotFoundError on both x86 and
ARM. Revert to plain cluster.shutdown() which is sufficient for the
single-use PGO training scenario.

Fixes: SCYLLADB-1792

Closes scylladb/scylladb#29746
2026-05-06 11:22:23 +02:00
Yaniv Michael Kaul
7557c64f20 test/cqlpy: add tests for hyphenated column names
Verify that double-quoted column names with hyphens (e.g. "my-col")
work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that
unquoted hyphenated names are rejected with a syntax error.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
d13a56be2e docs/cql: fix UDT example to use frozen<address>
The 'address' UDT contains a nested collection (map<text, frozen<phone>>),
so it must be frozen when used as a column type. Non-frozen UDTs with
nested non-frozen collections are not supported.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
5c528e4e02 docs/cql: fix CREATE TABLE example with hyphenated column names
Column names containing hyphens must be double-quoted. Also fix
the PRIMARY KEY reference from 'customer_id' (non-existent) to
'cust_id' (the actual column).
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
3e2b0f844c docs/cql: fix missing opening quote in ALTER KEYSPACE example
The dc2 key was missing its opening single quote: dc2' should be 'dc2'.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
815aad50af docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
The grammar requires IF NOT EXISTS to appear before USING TTL,
not after. The example had 'USING TTL 86400 IF NOT EXISTS' which
produces a syntax error.
2026-05-06 11:32:04 +03:00
Karol Nowacki
20b953ef8c vector_search: test: migrate paging warnings tests to Python
Move the paging warning related tests from C++ vector_store_client_test to
Python test_vector_search_with_vector_store_mock.
2026-05-05 18:23:30 +02:00
Karol Nowacki
84787ce6a5 vector_search: test: migrate local_vector_index to Python
Move the local vector index test from C++ vector_store_client_test to
Python test_vector_search_with_vector_store_mock.

The test creates a local vector index on ((pk1, pk2), embedding) and
verifies that SELECT with partition key restriction and ANN ordering
works correctly.
2026-05-05 18:23:30 +02:00
Karol Nowacki
0bb7e47090 vector_search: test: migrate vector_index_with_additional_filtering_column to Python
Move the SCYLLADB-635 regression test from C++ vector_store_client_test
to Python test_vector_search_with_vector_store_mock.

The test creates a vector index on (embedding, ck1) and verifies that
SELECT with ANN ordering works correctly when additional filtering
columns are included in the index definition.
2026-05-05 18:23:30 +02:00
Karol Nowacki
5a8af3c727 vector_search: test: migrate cql_error_contains_http_error_description to Python
Move the test that verifies HTTP error descriptions from the vector
store are propagated through CQL InvalidRequest messages from the C++
vector_store_client_test to the Python test_vector_search_with_vector_store_mock.

The test configures the mock to return HTTP 404 with 'index does not
exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.
2026-05-05 18:23:30 +02:00
Karol Nowacki
b672972c5f vector_search: test: migrate pk in restriction test to Python
Move vector search (ANN ordered select query) with IN restrictions on
partition key from C++/Boost test suite to pytest (cqlpy).

Add VectorStoreMock server as pytest fixture to simulate vector store
responses.
2026-05-05 18:23:30 +02:00
Karol Nowacki
207de967fb vector_search: test: default timeout in test_dns_resolving_repeated
Replace explicit 1-second timeouts in repeat_until() with the default
STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for
loaded CI environments where lowres_clock granularity (~10ms) combined
with OS scheduling delays and resource contention (-c2 -m2G) could cause
the loop to expire before the DNS refresh task completes its cycle.

This also unifies test timeouts across test cases.
2026-05-05 17:23:39 +02:00
Karol Nowacki
4722be1289 vector_search: test: fix flaky test_dns_resolving_repeated
Move trigger_dns_resolver() inside the repeat_until loop instead of
calling it once before the loop.

The test was intermittently timing out on CI. The exact root cause is not
fully understood, but the hypothesis is that a single trigger signal can
be lost somewhere (not exactly known where). This is not an issue for the
production code because refresh trigger will be called multiple times -
in every query where all configured nodes will be unreachable.

By triggering inside the loop, we ensure the signal is re-sent on
each iteration until the resolver actually performs the refresh and
picks up the new (failing) DNS resolution. This makes the test
resilient to timing-dependent signal loss without changing production
code.

Fixes: SCYLLADB-1794
2026-05-05 17:23:39 +02:00
Marcin Maliszkiewicz
5c5306c692 test: add reproducer for auth cache crash on missing permissions column 2026-05-05 17:16:25 +02:00
Marcin Maliszkiewicz
df69a5c79b auth: tolerate missing permissions column in authorize()
Ghost rows in role_permissions with a live row marker but no permissions
column can occur when permissions created via INSERT (e.g. by the removed
auth v2 migration) are later revoked. The row marker survives the revoke,
leaving a row visible to queries but with permissions=null.

Add a has() guard before accessing the permissions column, matching the
pattern already used in list_all(). Return NONE permissions for such
ghost rows instead of crashing.
2026-05-05 15:50:40 +02:00
Marcin Maliszkiewicz
c44625ebdf auth: add defensive has() guard for role_attributes value column
Add a has() check before accessing the value column in role_attributes
to tolerate ghost rows with missing regular columns. In practice this
is unlikely to be a problem since attributes are not typically revoked,
but the guard is added for consistency and defensive programming.
2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz
797bc28aae auth: remove unused permissions field from cache role_record
The permissions field in role_record was populated by fetch_role() but
never read. Authorization uses cached_permissions instead, which is
loaded via the permission_loader callback. Remove the dead field and
its fetch code.

The removed code also did not check for missing columns before accessing
the permissions set, which could crash on ghost rows left by the removed
auth v2 migration. The migration used INSERT (creating row markers),
and when permissions were later revoked, the row marker survived while
the permissions column became null.
2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz
c00fee0316 Merge 'utils: loading_cache: add insert() that is a no-op when caching is disabled' from Dario Mirovic
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
    Assertion `caching_enabled()' failed.
        at utils/loading_cache.hh:319
        in authorized_prepared_statements_cache::insert
```

`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.

Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.

Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.

Fixes SCYLLADB-1699

The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.

Closes scylladb/scylladb#29638

* github.com:scylladb/scylladb:
  test: boost: regression test for loading_cache::insert with caching disabled
  utils: loading_cache: add insert() that is a no-op when caching is disabled
2026-05-05 15:33:49 +02:00
Patryk Jędrzejczak
9f692857be test/cluster: increase failure_detector_timeout
Scaling the timeout by build mode (#29522) turned out to be not sufficient.
Nodes can still be unexpectedly marked as down, even with a 4s timeout in
dev mode. I managed to reproduce SCYLLADB-864 in such conditions.

Increasing failure_detector_timeout will proportionally slow down tests
that use it. That's bad, but currently these tests' flakiness is a much
bigger problem than the tests' slowness. Also, not many tests use this
fixture, and we hope to make it unneeded eventually (see #28495).
2026-05-05 15:12:33 +02:00
Patryk Jędrzejczak
efe0e39d85 gossiper: fix false node conviction when echo succeeds
failure_detector_loop_for_node() could falsely convict a healthy node
even when the echo succeeded. The code computed diff = now - last
(time since last successful echo) and checked diff > max_duration
unconditionally, regardless of whether the current echo failed or
succeeded.

This caused flakiness in tests that decrease the failure detector
timeout. We currently run #CPUs tests concurrently, and since cluster
tests start multiple nodes with 2 shards, multiple shards contend for
one CPU. As a result, some tasks can become abnormally slow and block
the failure detector loop execution for a few seconds.

Fix by only checking diff > max_duration when the echo actually
failed.

Note that we send echo with the timeout equal to `max_duration` anyway,
so the receiver will be marked as down if it really doesn't respond.
2026-05-05 15:12:32 +02:00
Patryk Jędrzejczak
b69d00b0a7 Merge 'Barrier and drain logging' from Gleb Natapov
Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281

Bakport since we want to have it if it happens in the field.

Fixes: SCYLLADB-1821
Refs: #26281

Closes scylladb/scylladb#29735

* https://github.com/scylladb/scylladb:
  session, raft_topology: add periodic warnings for hung drain and stale version waits
  session: add info-level logging to drain_closing_sessions
  raft_topology: log sub-step progress in local_topology_barrier
  raft_topology: log read_barrier progress in topology cmd handler
2026-05-05 15:04:50 +02:00
Calle Wilund
5cdfdd9ba3 commitlog_test.py: Fix size check aliasing, and threshold calc.
Fixes: SCYLLADB-1815

Checking segment sizes should not use a size filter that rounds
(up) sizes.
More importantly, the estimate for what is acceptable limit for
commitlog disk usage should be aligned. Simplified the calc, and
also made logging more useful in case of failure.
2026-05-05 14:42:55 +02:00
Nadav Har'El
b70beb3e13 alternator: improve CreateTable/UpdateTable schema agreement timeout
CreateTable and UpdateTable call wait_for_schema_agreement() after
announcing the schema change, to ensure all live nodes have applied
the new schema before returning to the user. This wait has a hard-
coded 10 second timeout, and on some overloaded test machines we
saw it not completing in time, and causing tests to become flaky.

This patch increases this timeout from 10 seconds to 30 seconds.
It's still hard-coded and not configurable via alternator_timeout_in_ms
because it is unlikely any user will want to change it - it just needs
to be long.

The patch also improves the behavior of a schema-agreement timeout,
when it happens:

1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the
   operation is unknown; So the user will repeat the CreateTable, and
   will get a ResourceInUseException because the table exists. In that
   case too, we need to wait for schema agreement. So we added this
   missing wait.

Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 15:41:06 +03:00
Calle Wilund
8d65a03951 commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
Refs: SCYLLADB-1757
Refs: SCYLLADB-1815

If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the
actual size of an entry to write, possibly causing segment size overshoot.

Break out some logic to share between this calc and new_buffer. Also remove redundant
(and possibly wrong) constant in oversized allocation.
2026-05-05 14:39:06 +02:00
Nadav Har'El
1f15e05946 test: fix replica_read_timeout_no_exception flakiness on slow systems
The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions.  As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.

The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.

On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables.  When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site.  This matches the
CI failure seen in SCYLLADB-1774:

  ERROR ... replica_read_timeout_no_exception: cql env callback failed,
  error: exceptions::read_timeout_exception (Operation timed out for
  replica_read_timeout_no_exception.tbl - received only 0 responses
  from 1 CL=ONE.)

The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.

The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.

No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
  timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
  data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
  there is the happy path, not a failure.

The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested.  The problem was only that the pre-test warm-up was not
guarded against the same timeout.

Fixes: SCYLLADB-1774

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29731
2026-05-05 15:13:13 +03:00
Botond Dénes
afd9a55891 Merge 'test/cluster: wait for custom listener readiness' from Piotr Smaron
server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every custom listener the test will connect to is already accepting raw TCP connections.
test_proxy_protocol_ssl_shard_aware connects directly to the shard-aware TLS proxy-protocol CQL port immediately after server startup. Wait for ServerUpState.SERVING in the fixture so the custom proxy-protocol listener is registered before opening raw sockets.
test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets.
test_perf_alternator_remote now asks server_add() to wait for SERVING and uses the returned server address directly. This removes the redundant running_servers() plus get_ready_cql() sequence noted in review.

Fixes: SCYLLADB-1797

No backport as of now, only appeared on master.

Closes scylladb/scylladb#29737

* github.com:scylladb/scylladb:
  test/cluster: avoid redundant perf alternator CQL wait
  test/cluster: wait for shard-aware CQL listener
  test/cluster: wait for proxy protocol ports to serve
2026-05-05 14:45:58 +03:00
Nadav Har'El
5895dff03b migration_manager: unique timeout exception for wait_for_schema_agreement()
Before this patch, if wait_for_schema_agreement() times out, it threw
a generic std::runtime_error, making it inconvenient for callers to
catch this error only. So in this patch we create and use a new exception
type, schema_agreement_timeout, based on seastar::timed_out_error.

Although wait_for_schema_agreement() was added in commit
a429018a8a was a utility function used in
a dozen places, it has become less interesting after we introduced schema
changes over Raft, and over the years most of the callers to this function
were removed, except one in view.cc which uses an infinite timeout, so
doesn't care about the timeout exception type.

In the next patch we want to add a new caller which *does* care about
the time exception type - hence this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 10:38:38 +03:00
Piotr Dulikowski
efcc0b6376 Merge 'table_helper: fix use-after-free on prepared-statement invalidation' from Marcin Maliszkiewicz
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:

1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
   the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
   (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
   releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.

Pin strong ref to _insert_stmt locally before the suspension.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667

Backport: all supported branches, it's memory corruption bug, long present

Closes scylladb/scylladb#29588

* github.com:scylladb/scylladb:
  test/boost: add dummy case to table_helper_test for non-injection modes
  test/boost: add regression test for table_helper insert() UAF
  utils/error_injection: add waiters() API
  table_helper: fix use-after-free on prepared-statement invalidation
2026-05-04 17:21:05 +02:00
Piotr Smaron
a3360ee385 test/nodetool: fix mock server port race by using a fixed port on a unique IP
Symptom: the rest_api_mock subprocess exits with status 1 during fixture
setup, e.g.:

    subprocess.CalledProcessError: Command '[..., 'rest_api_mock.py',
        '127.29.88.1', '34093']' returned non-zero exit status 1

Root cause: aiohttp's TCPSite.start() raises OSError(EADDRINUSE) and the
process exits 1. The bind fails because of how the (ip, port) pair is
chosen across modules within one test.py process:

  * Each test module leases a 127.x.y.z IP from the host registry. The
    registry recycles released IPs, so the same IP is shared across
    modules sequentially.
  * The original code picked the port via random.randint(10000, 65535).
    A previous module on the same IP could have left that port in
    TIME_WAIT (or worse, still actively in use) when a later module
    happened to pick the same port.

SCYLLADB-1275 (PR 29314) tried to fix this by binding a probe socket to
(ip, 0) to obtain an OS-assigned free port, closing the probe, then
launching the mock server which would bind to that port. Two issues
remained:

  1. TOCTOU: between probe close and mock-server bind, any other process
     on the host could grab the just-freed port.
  2. TIME_WAIT could still bite if the host registry recycled an IP and
     the OS reused the same port number for the probe.

Fix: drop port discovery entirely. Use a fixed port (12345, matching the
unshare-namespace path already in this fixture) on the unique IP from
the host registry. Because IPs are unique per test module within one
test.py process, the (ip, 12345) pair is unique to each module, so no
port-collision dance is needed.

reuse_address=True on TCPSite handles the residual TIME_WAIT case when
the host registry recycles an IP within the same test.py process and
the previous mock server's socket has not finished TIME_WAIT yet.
reuse_port=True is dropped, as it was only useful while attempting to
have multiple processes share a single port.

This mirrors the design used in test/cqlpy/run.py: pick a unique IP,
keep the port fixed.

Fixes: SCYLLADB-1718

Closes scylladb/scylladb#29656
2026-05-04 15:33:19 +02:00
Gleb Natapov
d2b695aa64 session, raft_topology: add periodic warnings for hung drain and stale version waits
Add periodic warning timers (every 5 minutes) to help diagnose hangs in
barrier_and_drain:

- drain_closing_sessions(): warn if semaphore acquisition or session gate
  close is taking too long, reporting the gate count to show how many
  guards are still alive.
- local_topology_barrier(): warn if stale_versions_in_use() is taking
  too long, reporting the current stale version trackers.
- session::gate_count(): new public accessor for diagnostic purposes.

These warnings help distinguish between the two possible hang points
in barrier_and_drain (stale versions vs session drain) and provide
ongoing visibility into what's blocking progress.
2026-05-04 15:58:45 +03:00
Gleb Natapov
385915c101 session: add info-level logging to drain_closing_sessions
drain_closing_sessions() is called as part of the barrier_and_drain
topology command and can block on two things: acquiring the drain
semaphore (if another drain is in progress) and waiting for individual
sessions to close (which blocks until all session guards are released).

Previously, all logging in this function was at debug level, making it
invisible in production logs. When barrier_and_drain hangs, there is no
way to tell whether the function is waiting for the semaphore, waiting
for a specific session to close, or was never called.

Promote logging to info level and add messages at each blocking point:
before/after semaphore acquisition (with count of sessions to drain),
before/after each individual session close (with session id), and at
function completion. This makes it possible to identify the exact
session blocking a topology operation from the node log alone.
2026-05-04 15:58:45 +03:00
Gleb Natapov
e88ce09372 raft_topology: log sub-step progress in local_topology_barrier
When a node processes a barrier_and_drain topology command, it performs
two potentially long-running operations inside local_topology_barrier():
waiting for stale token metadata versions to be released
(stale_versions_in_use) and draining closing sessions
(drain_closing_sessions). Either of these can hang indefinitely -- for
example, stale_versions_in_use blocks until all references to previous
token metadata versions are released, which depends on in-flight
requests completing.

Previously, the only logging was a single 'done' message at the end,
making it impossible to determine which sub-step was blocking when a
barrier_and_drain RPC appeared stuck on a node. In a recent CI failure,
a node never responded to barrier_and_drain during a removenode
operation, and the logs showed the RPC was received but nothing about
what it was waiting on internally.

Add info-level logging before each blocking sub-step, including the
topology version for correlation. This allows diagnosing hangs by
showing whether the node is stuck waiting for stale metadata versions,
stuck draining sessions, or never reached these steps at all.
2026-05-04 15:58:45 +03:00
Piotr Smaron
0a780d0ea1 test/cluster: avoid redundant perf alternator CQL wait
server_add() already waits for the requested server-up state. For the remote perf-alternator test, request SERVING from server_add() and use the returned server address directly instead of asking for running servers and then calling get_ready_cql() again.

This keeps the listener-readiness intent explicit while removing the redundant CQL readiness probe noted in review.
2026-05-04 14:09:28 +02:00
Piotr Smaron
c90012c22b test/cluster: wait for shard-aware CQL listener
server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every CQL listener configured for the process is already accepting raw TCP connections.

test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently.

Wait for ServerUpState.SERVING before opening raw sockets. Scylla sends that notification only after protocol servers are registered, so this closes the startup window without adding sleeps or local retry loops.

Fixes: SCYLLADB-1797
2026-05-04 13:36:43 +02:00
Nadav Har'El
983eb5ab43 test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test
test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old
nodes concurrently, then adds a new node before issuing CREATE ROLE.  The
concurrent bootstraps trigger the well-known Python driver bug
(scylladb/python-driver#317): two on_add notifications race in
update_created_pools, causing a second pool to be created for a host whose
pool was already established.  If CREATE ROLE is in-flight on the old pool
when it is closed, the driver retries on the new pool, executing the
statement twice.  The second execution fails with "Role ... already exists",
making the test flaky.

Fix by using CREATE ROLE IF NOT EXISTS.  This is safe because unique_name()
generates a timestamp+random suffix that is guaranteed to be unique; the
role can "already exist" only due to the driver double-execution bug, never
due to a real conflict.

This is the same workaround that has been applied many times elsewhere in
our test suite for exactly the same root cause:
- CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368,
  later generalised in scylladb#22399 via new_test_keyspace helpers)
- DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487)

Fixes: SCYLLADB-1742

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29732
2026-05-04 11:47:11 +02:00
Yaniv Michael Kaul
6179406467 raft/group0: fix destroy assertion on startup failure
If start_server_for_group0() successfully registers a server in
_raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine())
throws, the server is never destroyed because abort_and_drain()/destroy()
check std::get_if<raft::group_id>(&_group0) which was only set after the
entire with_scheduling_group block completed.

Move _group0.emplace<raft::group_id>() inside the lambda, immediately after
start_server_for_group() succeeds, so that cleanup paths can always find
and destroy the registered server.

This fixes the assertion:
  "raft_group_registry - stop(): server for group ... is not destroyed"

which manifests during shutdown after an upgrade where topology_state_load()
fails due to netw::unknown_address.

Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades

Refs: SCYLLADB-1217
Refs: CUSTOMER-340
Refs: CUSTOMER-335
Fixes: SCYLLADB-1801
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: Yes, Opencode/Opus 4.6

Closes scylladb/scylladb#29702
2026-05-04 11:25:46 +02:00
Piotr Smaron
689117f706 test/cluster: wait for proxy protocol ports to serve
server_add()'s default readiness only waits until CQL can be queried, but these tests immediately connect to custom proxy protocol listeners. Wait for SERVING so the shard-aware TLS proxy port is accepting connections before the test starts, matching the Alternator proxy protocol readiness fix.
2026-05-04 10:23:03 +02:00
Nadav Har'El
d33bb6ea00 Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity
Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened).

After restarting servers[1], the topology coordinator can initiate a **residual re-repair** when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the *next* retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys.

1. **Propagating `current_key` through the exception** — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt.

2. **DROP TABLE + CREATE TABLE between retries** — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues.

Instead of trying to clean up contaminated state, each retry creates a **completely fresh keyspace** (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures.

The detection is now comprehensive:
- **Broadened coordinator check**: any coordinator change (`new_coord != coord`), not just migration to servers[1]
- **Re-repair detection** at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log

1. **`test: extract _setup_table_for_race_window helper`** — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change.

2. **`test: fix race window test flakiness from residual re-repair`** — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception.

Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode.

Fixes: SCYLLADB-1478

So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen.

Closes scylladb/scylladb#29721

* github.com:scylladb/scylladb:
  test: fix race window test flakiness from residual re-repair
  test: extract _setup_table_for_race_window helper for race window test
2026-05-03 14:47:19 +03:00
Gleb Natapov
11b838e71e raft_topology: log read_barrier progress in topology cmd handler
When a raft topology command (e.g. barrier_and_drain) is received by a
node, the handler first performs a raft read_barrier to ensure it sees
the latest topology state. This read_barrier can hang indefinitely if
raft cannot achieve quorum, but there was no logging around it, making
it impossible to tell whether the handler was stuck at this step or
somewhere else.

Add info-level logging before and after the read_barrier call in
raft_topology_cmd_handler, including the command type, index, and term.
This allows diagnosing hangs by showing whether the node entered the
read_barrier and whether it completed, narrowing down the root cause
when a topology command RPC appears stuck on the receiver side.
2026-05-03 13:56:25 +03:00
Aleksandr Bykov
8afdae24d2 test: fix flaky test_kill_coordinator_during_op
The test hardcoded the expected number of coordinator elections
(2, 3, 4, 5) for each phase. If a prior phase triggered an extra
election, subsequent phases would wait for a count that was already
reached or would never match.

Fix by reading the current election count before each operation and
expecting exactly one more, making each phase independent of prior
history.

Also add wait_for_no_pending_topology_transition() calls after each
coordinator election to ensure the topology state machine has fully
settled before proceeding with restarts and further operations.

Decrease the failure detector timeout (failure_detector_timeout_in_ms)
to 2000 ms on all test nodes so that coordinator crashes are detected
faster, reducing test wallclock time and timeout-related flakiness.

Enable raft_topology=trace logging on all test nodes to aid
post-failure diagnosis. Add diagnostic logging in
wait_new_coordinator_elected().

Fixes: SCYLLADB-1089

Closes scylladb/scylladb#29284
2026-04-30 21:27:56 +03:00
Avi Kivity
795478fa7a test: fix race window test flakiness from residual re-repair
The test_incremental_repair_race_window_promotes_unrepaired_data test
was still flaky because:

1. Only coordinator changes TO servers[1] were detected, but ANY
   coordinator change can trigger a residual re-repair that flushes
   memtables on all replicas and marks post-repair data as repaired.

2. Even without a coordinator change, the topology coordinator can
   initiate a residual re-repair when it sees tablets stuck in the
   repair stage after the servers[1] restart.  This re-repair
   contaminates the repaired set with post-repair data, masking the
   compaction-merge bug the test detects.

Fix by:
- Broadening the coordinator check from == servers[1] to != coord
- Adding re-repair detection (grep for 'Initiating tablet repair
  host=') at three points: post-restart, during the compaction poll,
  and after injection release
- On retry, creating a completely fresh keyspace+table via
  _setup_table_for_race_window() so the new attempt starts with
  clean tablet metadata uncontaminated by prior re-repairs

Fixes: SCYLLADB-1478
2026-04-30 18:40:18 +03:00
Avi Kivity
12d5e758ed test: extract _setup_table_for_race_window helper for race window test
Move the keyspace+table setup logic for
test_incremental_repair_race_window_promotes_unrepaired_data into a
dedicated helper function _setup_table_for_race_window().  The helper
creates a fresh keyspace (unique name via create_new_test_keyspace),
the table, configures STCS min_threshold=2, inserts baseline keys,
runs repair 1, inserts keys for repair 2, and flushes.

This is a pure refactor with no behavioral change: the test function
now calls the helper once instead of inlining the setup.  The
extraction enables a subsequent commit to call the helper again on
retry when a leadership transfer is detected.
2026-04-30 18:37:42 +03:00
Dario Mirovic
3875d79ac6 test: boost: regression test for loading_cache::insert with caching disabled
Add two test cases for the new loading_cache::insert() method:

 * test_loading_cache_insert verifies that insert() populates the cache
   and invokes the loader exactly once per key when caching is enabled.

 * test_loading_cache_insert_caching_disabled is a regression test for
   SCYLLADB-1699: when the cache is constructed with expiry == 0
   (caching disabled), insert() must be a no-op rather than asserting
   in loading_cache::get_ptr() via caching_enabled(). The loader must
   not be invoked and the cache must remain empty.

Refs SCYLLADB-1699
2026-04-30 16:52:51 +02:00
Dario Mirovic
918130befd utils: loading_cache: add insert() that is a no-op when caching is disabled
When the cache is constructed with expiry == 0 the underlying storage is
never instantiated and get_ptr() asserts via caching_enabled(). This is
fine for callers that need a handle into the cache, but it makes get_ptr()
unusable for write-only insertions on caches whose expiry is configurable
at runtime (e.g. caches driven by a LiveUpdate config option that the
operator may set to 0).

Add a new insert(k, load) method on loading_cache that returns a future<>
and is a no-op when caching is disabled, otherwise forwards to
get_ptr(k, load) and discards the resulting handle. This completes the
disabled-mode safety contract of the cache for the write side, mirroring
the fallback that get() already provides for the read side.

Switch authorized_prepared_statements_cache::insert() from
get_ptr().discard_result() to the new insert(), which fixes the crash
'Assertion caching_enabled() failed' in
authorized_prepared_statements_cache::insert() that occurs when
permissions_validity_in_ms is set to 0 and a prepared statement is
executed under authentication.

Fixes SCYLLADB-1699
2026-04-30 16:51:23 +02:00
Marcin Maliszkiewicz
b08e0c67e4 test/boost: add dummy case to table_helper_test for non-injection modes
The only test requires SCYLLA_ENABLE_ERROR_INJECTION. In modes without it
(e.g. release) the suite was empty, so pytest exited with code 5
("no tests collected") and CI failed. Add a no-op case in that branch
so collection always yields at least one test.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
515b5722fd test/boost: add regression test for table_helper insert() UAF
Deterministic reproducer using an error injection point placed in
table_helper::insert() between cache_table_info() and execute(). The
test parks fiber A at the injection, drops the target table (evicting
the prepared_statements_cache entry), runs fiber B which nulls
_insert_stmt, then releases fiber A. Without the fix this crashes in
execute(); with the fix fiber A holds a local strong ref and proceeds.

Uses the new waiters() API to synchronize with fiber A's entry into
the injection.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
4d234aaaa5 utils/error_injection: add waiters() API
Returns the number of fibers currently suspended in wait_for_message()
for a named injection. Lets tests synchronize precisely with code parked
on an injection point.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
aa18c3ed4a table_helper: fix use-after-free on prepared-statement invalidation
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:

1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
   the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
   (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
   releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.

Pin strong ref to _insert_stmt locally before the suspension.
2026-04-30 11:45:12 +02:00
Ernest Zaslavsky
1febfbd9b5 test: rename sstable_tablet_streaming.cc to match the naming convention
apparently, boost test MUST end with "_test" to be executed by the test.py

Closes scylladb/scylladb#29693
2026-04-30 11:16:39 +03:00
Pavel Emelyanov
1ca97f0c0a Merge 'test: fix disabled test handling and deduplicate CLI test arguments' from Evgeniy Naydanov
- Revert the previous "test.py: fix test collection bug" commit (92c09d10) which worked around broken deduplication by filtering items without `BUILD_MODE` in `pytest_collection_modifyitems`. This approach masked the root cause and is superseded by the proper fixes below.
- Backport pytest 9.0.3's argument normalization algorithm into `test.py` to work around broken deduplication in pytest 8.3.5 ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083)). Duplicate or subsumed test paths (e.g. `test/cql` and `test/cql/lua_test.cql`) are collapsed before invoking pytest. Revert when upgrading to pytest 9.x.
- Return a `DisabledFile` collector instead of an empty list in `pytest_collect_file` when all modes are disabled for a file, fixing a bug where subsequent files would not get their stash items set (`REPEATING_FILES`). Restructure `pytest_collect_file` to use a walrus operator (`if repeats := ...`) with a single `remove(file_path)` and `return collectors` at the end, eliminating the early return.
- Add `--keep-duplicates` CLI argument to bypass deduplication and forward to pytest.
- Move `RUN_ID` assignment from `pytest_collect_file` to `modify_pytest_item`. A shared `run_ids` cache (`defaultdict[tuple[str, str], count]`) is created in `pytest_collection_modifyitems` and passed to `modify_pytest_item`, keyed by `(build_mode, nodeid)` so each mode gets independent counters. This ensures unique run IDs even when `--keep-duplicates` causes the same file to be collected multiple times.
- Fix `--repeat` option default from string `"1"` to int `1` — argparse only applies `type=` to CLI-parsed values, not defaults.

pytest normally deduplicates overlapping test arguments — e.g. `test/cql test/cql/lua_test.cql` collects `lua_test.cql` only once. The original `test.py` never performed this deduplication, and the pytest version in the toolchain image (8.3.5) has a bug that breaks it ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083).)

Since we are moving to bare pytest, `test.py` should match pytest's default behavior: deduplicate. Because we cannot easily upgrade pytest, commit 2 backports the deduplication logic from pytest 9.0.3.

To match pytest's interface, `--keep-duplicates` is added as an opt-out. This lets a user intentionally run overlapping paths — e.g. `./test.py test/blah test/blah/test_foo.py --keep-duplicates` runs `test_foo.py` twice. The flag is forwarded to pytest and also skips the backported deduplication in `test.py`.

- Revert 92c09d10 which filtered items without `BUILD_MODE` in `pytest_collection_modifyitems` and added an early return in `CppFile.collect()`. This workaround is superseded by the proper deduplication and `DisabledFile` fixes.

- Add `_CollectionArgument` dataclass (`order=True`, `__contains__` for subsumption) and `_deduplicate_test_args()` function, adapted from pytest 9.0.3. Marked with a TODO to remove once we update to pytest 9.x.
- Call `_deduplicate_test_args()` on `options.name` before passing to pytest.

- Add `DisabledFile(pytest.File)` that skips collection with an informative message instead of returning an empty list.
- Restructure `pytest_collect_file` to use walrus operator: `if repeats := ...:` / `else:` — single `remove(file_path)` at end, no early return.

- Add `--keep-duplicates` argument that skips deduplication and is forwarded to pytest.
- Create a shared `run_ids` cache in `pytest_collection_modifyitems` and pass it to `modify_pytest_item`, which assigns unique sequential RUN_IDs via `itertools.count`. The cache is keyed by `(build_mode, nodeid)` so each mode gets independent counters.
- Remove `RUN_ID` from `_STASH_KEYS_TO_COPY` — it is no longer set on collectors.
- Remove `CppFile.run_id` cached_property. `CppTestCase` now reads `RUN_ID` from its own item stash.
- Fix `--repeat` option default from `"1"` to `1` and drop redundant `int()` cast.

Closes SCYLLADB-1730

Closes scylladb/scylladb#29665

* github.com:scylladb/scylladb:
  test: add --keep-duplicates and assign RUN_ID via shared cache
  test/pylib/runner: fix disabled file collection
  test.py: deduplicate CLI test arguments before passing to pytest
  Revert "test.py: fix test collection bug"
2026-04-30 07:58:25 +03:00
Yaniv Michael Kaul
93722f2c89 gms/gossiper: fix use-after-move in do_send_ack2_msg
The second logger.debug() call accesses ack2_msg after it was moved
via std::move() in the co_await send_gossip_digest_ack2 call.
This is undefined behavior.

Fix by formatting ack2_msg to a string before the move, then using
that cached string in both debug log calls.

FIXES: https://scylladb.atlassian.net/browse/SCYLLADB-1778

Closes scylladb/scylladb#29227
2026-04-30 07:07:39 +03:00
Wojciech Mitros
ebaf536449 replica/database: fix cross-shard deadlock in lock_tables_metadata()
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard.  It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.

When two fibers call lock_tables_metadata() concurrently, this can
deadlock.  parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues.  The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue.  This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).

In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.

This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata().  When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:

  1. activate_large_data_virtual_tables() — calls
     legacy_drop_table_on_all_shards() which calls
     lock_tables_metadata() synchronously via .get()

  2. reload_schema_in_bg() — fires as a background fiber from
     TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
     schema_applier::commit() which also calls lock_tables_metadata()

If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions.  The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.

Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.

Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time

Also added a unit test for concurrent lock_tables_metadata() calls.

Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684

Closes scylladb/scylladb#29678
2026-04-29 21:13:53 +02:00
Patryk Jędrzejczak
15f35577ed Merge 'paxos_state: keep prepared message alive across statement execution' from Petr Gusev
In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute().

Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable.

The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with
checked_ptr_is_null_exception.

The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as:

  WriteFailure: Failed to prepare ballot ... Replica errors:
  host_id ... -> seastar::rpc::remote_verb_error (std::exception)

Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle.

Fixes SCYLLADB-1173

backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected.

Closes scylladb/scylladb#29675

* https://github.com/scylladb/scylladb:
  table_helper: retry insert prepare on concurrent cache invalidation
  paxos_state: keep prepared message alive across statement execution
2026-04-29 17:57:27 +02:00
Yaron Kaikov
d310e4b27d scylla-gdb: fix compaction-tasks command for intrusive list
Since commit e942c074f2 changed _tasks from std::list<shared_ptr<...>>
to a boost::intrusive_list, iterating yields raw compaction_task_executor
objects rather than shared_ptr wrappers. The GDB script was updated to
use intrusive_list() but still wrapped elements in seastar_shared_ptr(),
causing 'gdb.error: There is no member or method named _p' when
compaction tasks are active.

Move the seastar_shared_ptr unwrapping to the 6.2 compatibility
fallback path only, since the intrusive list path yields objects
directly.

Fixes: SCYLLADB-1762

Closes scylladb/scylladb#29690
2026-04-29 13:11:13 +03:00
Marcin Maliszkiewicz
45b4834ac4 Merge 'audit: fix maintenance socket startup/shutdown ordering' from Andrzej Jackowski
This series addresses three problems in the audit startup/shutdown
sequence:
1. [BUG] Shutdown SIGABRT. During graceful shutdown, deferred stops run in reverse order of construction. With the audit service constructed after the maintenance socket, audit was destroyed first, and in-flight queries on the maintenance socket could hit the destroyed audit service (assertion failure in sharded::local()).
2. [BUG] Startup audit bypass. The maintenance socket opened before audit storage was initialized, allowing queries (e.g. creating a superuser) to bypass auditing in that window.
3. [PROBLEM] Blocks SCYLLADB-1430. The existing order prevents audit configuration from being driven by group0 state, because audit started before group0.

The series is organized as: a test-helper refactor, a test for the audited maintenance-socket flow, a startup-phase split, the construction-order fix and its shutdown-race test, and finally the storage-before-socket fix and its startup-window test.

Fixes SCYLLADB-1615

No backport, bugs don't seem severe enough to justify backporting.

Closes scylladb/scylladb#29539

* github.com:scylladb/scylladb:
  audit: assert storage ordering invariants at runtime
  audit: start maintenance socket after audit storage
  audit: move audit construction before maintenance socket
  audit: split startup into construction and storage phases
  test: audit: verify maintenance socket operations are audited
  test: audit: parameterize source address in audit assertions
2026-04-29 10:37:38 +02:00
Łukasz Paszkowski
7e14ea5ac8 sstables: only wipe TemporaryHashes for sstable formats that have it
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.

The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.

Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.

Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697

Closes scylladb/scylladb#29620
2026-04-29 08:06:36 +03:00
Botond Dénes
809f12f988 Merge 'test/cluster/dtest: fix ScyllaNode state not persisting across nodelist() calls' from Benny Halevy
`ScyllaCluster.nodelist()` creates new `ScyllaNode` objects on every call,
so per-node state set via `set_smp()`, `set_log_level()`, and
`_adjust_smp_and_memory()` was lost. This meant `set_smp()` had no effect
when `cluster.start()` was called after it, since `start_nodes()` calls
`nodelist()` internally which creates fresh nodes with default values.

- Add debug logging for smp/memory in ScyllaNode
- Store per-node settings (smp, memory, log levels) in a
  `ScyllaCluster._node_resources` dict keyed by server_id, so they survive
  `nodelist()` reconstruction. `ScyllaNode` restores its state from this dict
  on construction and saves it back whenever `set_smp()`, `set_log_level()`,
  or `_adjust_smp_and_memory()` modifies it.
- Add a reproducer test verifying `set_smp()` takes effect on restart

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1629

--

No backport needed: this only fixes dtest infrastructure, no production code
is affected.

Closes scylladb/scylladb#29549

* github.com:scylladb/scylladb:
  test/cluster/dtest: add test for node.set_smp() persistence
  test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
  test/cluster/dtest/ccmlib/scylla_node: add debug logging
2026-04-29 06:25:36 +03:00
Evgeniy Naydanov
96d3f13245 test: add --keep-duplicates and assign RUN_ID via shared cache
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.

Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.

Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.

Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
497bd6b6c9 test/pylib/runner: fix disabled file collection
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file.  Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
43f06ed19d test.py: deduplicate CLI test arguments before passing to pytest
Backport the argument normalization algorithm from pytest 9.0.3 to
work around broken deduplication in pytest 8.3.5
(https://github.com/pytest-dev/pytest/issues/12083).

Duplicate or subsumed test paths (e.g. 'test/cql' and
'test/cql/lua_test.cql') are now collapsed before invoking pytest.

Revert this commit when upgrading to pytest 9.x.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
05f2c53931 Revert "test.py: fix test collection bug"
This reverts commit 92c09d106d.
2026-04-29 02:35:00 +00:00
Andrzej Jackowski
3755c370ac audit: assert storage ordering invariants at runtime
Abort if audit storage fails to start rather than silently
running with an unaudited maintenance socket. Also assert
that storage is already stopped when the audit service is
destroyed, documenting the defer-stack ordering requirement.

Refs SCYLLADB-1615
Refs SCYLLADB-1695
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
543fb6a2db audit: start maintenance socket after audit storage
Without this, there is a window after startup where queries on
the maintenance socket bypass auditing because audit storage
is not yet initialized.

Fixes SCYLLADB-1615
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
b7bc2d89e6 audit: move audit construction before maintenance socket
During graceful shutdown, deferred stops run in reverse order of
construction.  When the audit service was constructed after the
maintenance socket, audit was destroyed first.  A DML query
still in-flight on the maintenance socket could then bypass
auditing entirely.

Move construction as early as possible so the audit service
outlives the maintenance socket on the defer stack, and to
maximise the window in which attempts to use audit before
storage is ready are caught with on_internal_error_noexcept.

Refs SCYLLADB-1615
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
bc67dd0b82 audit: split startup into construction and storage phases
The table-based audit backend needs Raft to create its keyspace,
but the audit service must exist earlier so that CQL paths don't
silently skip auditing.

Split startup into two phases: construction and storage
initialization.  Queries arriving between the two phases are
logged as errors.

This is a refactoring commit and the split sections will be
moved later in this patch series.

Refs SCYLLADB-1615
2026-04-28 18:58:42 +02:00
Andrzej Jackowski
1616c71bf0 test: audit: verify maintenance socket operations are audited
User creation via the maintenance socket should produce audit
entries, as this is the recommended flow for creating the
initial superuser when default credentials are disabled.

The test is parametrized by audit backend (table and syslog).
The maintenance socket source address is "::" because Seastar
returns a zero-initialised in6_addr for AF_UNIX sockets.

Test time in dev: 0.6s

Refs SCYLLADB-1615
2026-04-28 18:42:39 +02:00
Avi Kivity
c4de2b3c9d Merge 'test: fix flaky tablets test by using read barrier' from Aleksandra Martyniuk
Some tests in test_tablets.py read system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1700

Test fix; no backport

Closes scylladb/scylladb#29655

* github.com:scylladb/scylladb:
  test: fix flaky rack list conversion tests by using read barrier
  test: fix flaky test_enforce_rack_list_option by using read barrier
2026-04-28 17:15:59 +03:00
Petr Gusev
e6137ab11b table_helper: retry insert prepare on concurrent cache invalidation
table_helper::insert() retrieves the prepared statement via
cache_table_info() and then dereferences _prepared_stmt to read
bound_names. _prepared_stmt is a checked_weak_ptr into the prepared
statements cache and can be invalidated at any time by a concurrent
purge (for example, on a schema change).

cache_table_info() (re-)prepares the statement and assigns
_prepared_stmt before returning, and the strong pin held by the
result_message::prepared returned from qp.prepare() keeps the cache
entry alive only for the duration of try_prepare(). After try_prepare()
returns, the pin is gone and _prepared_stmt is the only remaining
handle on the entry.

In release builds this is fine: the chain of ready-future co_awaits
between try_prepare() finishing and _prepared_stmt->bound_names being
read resumes synchronously, so no other task -- in particular, no
cache purge -- can run in that window. In debug builds, however,
Seastar inserts a reactor preemption point on every co_await even when
the awaited future is ready. That preemption window is wide enough for
a concurrent invalidation to drop the freshly installed cache entry,
turning _prepared_stmt into a null weak handle and crashing the
subsequent dereference with checked_ptr_is_null_exception.

Wrap the cache_table_info() call in a loop that re-attempts the
preparation until a synchronous post-resume check finds _prepared_stmt
still valid. The check runs in the same task immediately after the
co_await resumes, with no co_await between the check and the
dereference, so a purge cannot slip in. _insert_stmt is a strong
shared_ptr to the statement object and is not affected by cache
invalidation, so it remains safe to use across the final co_await on
execute().

The other caller of cache_table_info(),
trace_keyspace_helper::apply_events_mutation(), accesses only the
strong _insert_stmt via insert_stmt() and never dereferences the weak
_prepared_stmt, so it is unaffected.

Refs SCYLLADB-1173
2026-04-28 16:03:06 +02:00
Ernest Zaslavsky
a97502920b test: optimize compaction_strategy_cleanup_method for remote storage
Parallelize SSTable creation using parallel_for_each. The file
count is made a parameter with a default of 64, allowing future
S3/GCS variants to use a smaller count if needed.
2026-04-28 16:59:38 +03:00
Ernest Zaslavsky
0b9a2844bd test: optimize stcs_reshape_overlapping for remote storage
Parallelize SSTable creation using parallel_for_each and reduce
the SSTable count from 256 to 64 for S3/GCS variants. The local
test variant retains the original 256 count.
2026-04-28 16:59:38 +03:00
Ernest Zaslavsky
ac89cffc9f test: optimize twcs_reshape_with_disjoint_set for remote storage
Parallelize SSTable creation across all sub-tests using
parallel_for_each and reduce the SSTable count from 256 to 64 for
S3/GCS variants.
Re-enable the S3 test variant that was previously disabled due to
taking 4+ minutes. With parallel creation and reduced count, the
test now completes in a reasonable time.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
01b4292f87 test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
923ff9abc9 test: parallelize SSTable creation in run_incremental_compaction_test
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations that depend on SSTable order.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
6a25f52473 test: parallelize SSTable creation in offstrategy_sstable_compaction
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
baca685629 test: parallelize SSTable creation in twcs_partition_estimate
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
716202b839 test: add trace-level logging for S3 and HTTP in compaction tests
Raise log levels for s3 and gcp_storage from debug to trace, and add
trace-level logging for http and default_http_retry_strategy modules.
This provides better visibility into storage backend interactions
when debugging slow or failing compaction tests on remote storage.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
a4ebe16517 test: make sstable test utilities natively async
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
4b637226a7 test: move make_memtable out of external_updater in row_cache_test
test_exception_safety_of_update_from_memtable called make_memtable
inside the row_cache::external_updater callback. external_updater
runs as a synchronous execute() call that must not yield, but
make_memtable calls seastar::thread::yield() every 10th mutation.

The bug was latent because the test only inserted 5 mutations, so
the yield was never reached. Move the call before the callback.

Prerequisite for the next patch, which changes make_memtable to
call make_memtable_async().get() -- that would yield on every
mutation via coroutine::maybe_yield(), making this bug visible.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
7c09f35ddf test: increase S3 max connections for compaction tests
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
2026-04-28 16:59:37 +03:00
Taras Veretilnyk
784127c40b sstables_loader: synchronously unlink streamed sstables before returning
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.

Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.

Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
2026-04-28 14:52:28 +02:00
Patryk Jędrzejczak
d9dd3bfe53 Merge 'topology_coordinator: join tablet load stats refresh in stop()' from Andrzej Jackowski
Commit 2b7aa32 (topology_coordinator: Refresh load stats after
table is created or altered) registered topology_coordinator as a
schema change listener and added on_create_column_family which
fire-and-forgets _tablet_load_stats_refresh.trigger(). The
triggered task runs on the gossip scheduling group via
with_scheduling_group and accesses the topology_coordinator via
'this'.

stop() unregisters the listener but does not wait for any
in-flight refresh task. If a notification fires between
_tablet_load_stats_refresh.join() in run() and unregister_listener
in stop(), the scheduled task can outlive the topology_coordinator
and access freed memory after run_topology_coordinator's coroutine
frame is destroyed.

Wait for the refresh to complete in stop() after unregistering the
listener, ensuring no task can fire after destruction.

Fixes SCYLLADB-1728

Backport to 2026.1 and 2026.2, because the issue was introduced in 2b7aa32

Closes scylladb/scylladb#29653

* https://github.com/scylladb/scylladb:
  test: tablet_stats: reproduce shutdown refresh race
  topology_coordinator: join tablet load stats refresh in stop()
2026-04-28 12:54:28 +02:00
Benny Halevy
5eaa979f35 test/cluster/dtest: add test for node.set_smp() persistence
Add a test that reproduces SCYLLADB-1629: set_smp() had no effect
because nodelist() created new ScyllaNode objects on every call,
losing the _smp_set_during_test value. The test fails without the
fix in the previous patch.
2026-04-28 12:34:08 +03:00
Benny Halevy
7430c1efd7 test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
ScyllaCluster.nodelist() was creating new ScyllaNode objects on every
call, so per-node state set via set_smp(), set_log_level(), and
_adjust_smp_and_memory() was lost between calls.

Fix by caching ScyllaNode instances in a list populated by
_add_nodes() using the list returned by servers_add() in populate().
Nodes are assigned monotonically increasing names (node1, node2, ...).
nodelist() simply returns the cached list.
2026-04-28 12:34:06 +03:00
Marcin Maliszkiewicz
b0f988afc4 Merge 'auth: fix shutdown and startup races in LDAP cache pruner' from Andrzej Jackowski
The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader):
- Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls.
- Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared.

Fixes SCYLLADB-1679
Backport to 2026.2, introduced in 7eedf50c12

Closes scylladb/scylladb#29605

* github.com:scylladb/scylladb:
  auth: make shutdown the exact reverse of startup
  test: ldap: add test for pruner crash during shutdown
  auth: start authorizer and set permission loader before role manager
  auth: stop role manager before clearing permission loader
  auth: reload LDAP permission cache on local shard only
2026-04-28 11:16:07 +02:00
Botond Dénes
a7e9c0e6d2 Merge 'test.py: fix test collection bug' from Andrei Chekun
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714

No backport, test framework bug fix only.

Closes scylladb/scylladb#29634

* github.com:scylladb/scylladb:
  test.py: fix framework test
  test.py: fix test collection bug
2026-04-28 11:52:35 +03:00
Petr Gusev
e39267b55f paxos_state: keep prepared message alive across statement execution
In do_execute_cql_with_timeout(), when the prepared statement was not
found in the cache, we called qp.prepare() and stored the returned
result_message::prepared in a local variable scoped to the 'if' block.
We then extracted ps_ptr (a checked_weak_ptr to the prepared statement)
from the message, let the message go out of scope at the end of the
'if', and used ps_ptr after a co_await on st->execute().

Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in
PREPARE result"), result_message::prepared owns a strong pinned
reference to the prepared cache entry. While qp.prepare() runs it also
holds its own pin on the entry, so on return the entry has at least
the pin owned by the returned message. As long as that message is
alive, the cache entry cannot be purged and the weak handle inside
ps_ptr remains promotable.

The lifetime gap manifested only in debug builds. qp.prepare() returns
a ready future on the cache-miss path, so in release builds the
co_await resumes synchronously: control flows from the assignment of
ps_ptr straight into st->execute() with no opportunity for any other
task (in particular, prepared cache invalidation triggered by a
concurrent schema change) to run in between. Debug builds, however,
force a reactor preemption point on every co_await even when the
awaited future is ready. With prepared_msg already destroyed at the
end of the 'if' block, the only remaining handle on the cache entry
was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a
node restart - the chance to drop the entry. The subsequent execute()
then failed when promoting the weak pointer with
checked_ptr_is_null_exception.

The exception propagated out of the Paxos prepare path as a generic
std::exception with no type information in the log, surfacing on the
coordinator as:

  WriteFailure: Failed to prepare ballot ... Replica errors:
  host_id ... -> seastar::rpc::remote_verb_error (std::exception)

Hoist the result_message::prepared into the outer scope so the pinned
cache entry stays alive across co_await st->execute(...), closing the
window in which a concurrent cache purge could invalidate the weak
handle.

Fixes SCYLLADB-1173
2026-04-28 10:42:13 +02:00
Botond Dénes
3ea4af1c8c Merge 'test/cluster/test_incremental_repair: fix flaky coordinator-change scenario' from Avi Kivity
- Ensure servers[1] is not the topology coordinator before restarting it, preventing the leader death + re-election + re-repair sequence that masked the compaction-merge bug
- Add a retry loop that detects post-restart leadership transfer to servers[1] via direct coordinator query, retrying up to 5 times

Fixes: SCYLLADB-1478

Backporting to 2026.2, which sees the failure regularly.

Closes scylladb/scylladb#29671

* github.com:scylladb/scylladb:
  test/cluster/test_incremental_repair: add retry for residual leadership race
  test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
2026-04-28 09:05:02 +03:00
Andrzej Jackowski
459e3970cd test: tablet_stats: reproduce shutdown refresh race
The coordinator can receive a schema-change notification after run()
finishes but before stop() unregisters listeners. The test pins that
window with error injections and verifies stop() waits for the refresh
instead of letting it outlive the coordinator.

Test time in dev: 9.51s

Refs SCYLLADB-1728
2026-04-28 08:00:54 +02:00
Andrzej Jackowski
8756f7c068 topology_coordinator: join tablet load stats refresh in stop()
Commit 2b7aa3211d made schema changes trigger tablet load stats
refreshes in the background. A notification can still arrive after
run() stops the periodic refresher and before the coordinator object
is destroyed.

Move lifecycle subscription cleanup to stop() and join the serialized
refresh there after unregistering refresh trigger sources. This keeps
the coordinator alive until notification-triggered refresh work has
completed.

Fixes SCYLLADB-1728
2026-04-28 07:37:28 +02:00
Avi Kivity
2615d0e8d8 test/cluster/test_incremental_repair: add retry for residual leadership race
There is a small race window where Raft leadership could transfer back
to servers[1] between the ensure_group0_leader_on() check and the
actual restart.  If this happens, the new coordinator re-initiates
repair and masks the compaction-merge bug.

Extract the core test logic into _do_race_window_promotes_unrepaired_data()
which directly checks get_topology_coordinator() after restart and raises
_LeadershipTransferred if servers[1] became coordinator.  The test
function calls this helper in a retry loop (up to 5 attempts).

Refs: SCYLLADB-1478
2026-04-27 21:11:06 +03:00
Avi Kivity
914b70c75b test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
The test_incremental_repair_race_window_promotes_unrepaired_data test
was flaky because it hardcodes servers[1] as the restart target but did
not ensure servers[1] was NOT the topology coordinator.

When servers[1] happened to be the Raft group0 leader (topology
coordinator), restarting it killed the leader, forced a new election,
and the new coordinator re-initiated tablet repair.  This re-repair
flushes memtables on all replicas via take_storage_snapshot() and marks
the resulting sstables as repaired -- causing post-repair keys to appear
in repaired sstables on servers[0] and servers[2].  The test then hit
the wrong assertion (servers[0]/[2] contaminated).

Fix: before starting the repair, check whether servers[1] is the
topology coordinator.  If so, move leadership to another server via
ensure_group0_leader_on() so that restarting servers[1] only kills a
follower -- which does not trigger an election or coordinator change.

Reproducibility was confirmed by forcing leadership to servers[1] via
ensure_group0_leader_on() and observing deterministic failure with all
three servers showing post-repair keys in repaired sstables (confirming
the re-repair scenario), then verifying the fix passes reliably.

Fixes: SCYLLADB-1478
2026-04-27 21:08:12 +03:00
Aleksandra Martyniuk
6b7ce5e244 test: fix flaky rack list conversion tests by using read barrier
test_numeric_rf_to_rack_list_conversion and
test_numeric_rf_to_rack_list_conversion_abort were reading
system_schema.keyspaces from an arbitrary node that may not have
applied the latest schema change yet. Pin the read to a specific node
and issue a read barrier before querying, ensuring the node has
up-to-date data.
2026-04-27 15:19:09 +02:00
Aleksandra Martyniuk
9d3d424d58 test: fix flaky test_enforce_rack_list_option by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.
2026-04-27 14:44:38 +02:00
Ferenc Szili
6b3e18c4a9 test: verify load balancer handles dropped tables gracefully
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
2026-04-27 10:33:56 +02:00
Ferenc Szili
4987204f71 tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
The load balancer's get_schema_and_rs() would trigger on_internal_error when
a table present in the token metadata snapshot had been concurrently dropped
from the live schema. This race is possible because the balancer coroutine
yields between building the candidate list and checking replication
constraints, allowing a DROP TABLE schema mutation to be applied by another
fiber in the meantime.

Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables
instead of aborting. Update all callers to skip dropped tables:
- make_sizing_plan: continue to next table
- make_resize_plan: continue to next table (merge suppression is moot)
- check_constraints: return skip_info with empty viable targets
- get_rs: return nullptr, checked by check_constraints
2026-04-27 10:33:53 +02:00
Andrei Chekun
f2f4915e09 test.py: fix framework test
Framework test was not skipping unit directory where C++ tests are
located. With bug fixing this started to fail. Add ignoring this
directory as well.
2026-04-25 18:04:55 +02:00
Andrei Chekun
92c09d106d test.py: fix test collection bug
In certain circumstances current way of collecting can be error prone.
Collection can stop when the first file is skipped in the mode leaving
the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file
explicitly, it will produce incorrect CppFile in the stash causing
KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
2026-04-24 17:57:11 +02:00
Dimitrios Symonidis
c40842f60a db, sstables: add node_owner to sstables registry primary key
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
  PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.

The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
2026-04-24 16:41:09 +02:00
Dimitrios Symonidis
ce78c5113e db, sstables: rename sstables registry column owner to table_id
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.

This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
2026-04-24 16:24:07 +02:00
Pavel Emelyanov
71b9704464 storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
The cas_contention_timeout_in_ms option is already exposed via the
shared updateable_timeout_config as cas_timeout_in_ms. Read it from
there instead of going through db::config, dropping another use of
database as a db::config proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 16:24:32 +03:00
Pavel Emelyanov
33cd3b5d68 alternator: Use shared updateable_timeout_config by reference
Pass sharded<updateable_timeout_config>& into alternator::controller
and through to alternator::server, which now stores a reference
instead of constructing its own updateable_timeout_config from
proxy.data_dictionary().get_config(). This removes the last
creator of a per-owner updateable_timeout_config copy and completes
the consolidation onto the single sharded<updateable_timeout_config>
instance built in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:29:39 +03:00
Pavel Emelyanov
1a045d0cdd cql_transport: Use shared updateable_timeout_config by reference
Pass sharded<updateable_timeout_config>& into cql_transport::controller,
which feeds the shard-local instance as a reference into
cql_server_config::timeout_config. This drops the per-shard local
updateable_timeout_config constructed from db::config inside the
controller's sharded_parameter lambda, replacing it with a reference
into the shared sharded instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:21:31 +03:00
Pavel Emelyanov
aa99c1fd6e storage_proxy: Use shared updateable_timeout_config by reference
Drop storage_proxy's own updateable_timeout_config member built from
db::config and take a reference to the shared sharded instance
introduced by the previous patch. Both main and cql_test_env pass
std::ref(timeout_cfg) into storage_proxy::start so each shard's
storage_proxy references its shard-local updateable_timeout_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:07:21 +03:00
Pavel Emelyanov
7b7295fde0 main: Introduce sharded<updateable_timeout_config>
Build a single sharded updateable_timeout_config from db::config in
both main and cql_test_env, sitting next to sharded<cql_config>.
Subsequent patches migrate storage_proxy, the CQL transport controller
and alternator server from their per-owner updateable_timeout_config
copies to references into this shared instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:03:35 +03:00
Andrzej Jackowski
8855e77465 auth: make shutdown the exact reverse of startup
The previous parallel stop of the authenticator and authorizer
was a micro-optimization that obscured the lifecycle invariant
that shutdown should reverse startup.

Refs SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
adf1e26bab test: ldap: add test for pruner crash during shutdown
Verify that service::stop() drains the LDAP pruner before
clearing the permission loader. The test installs a slow
permission loader and confirms the pruner is actively
reloading when teardown begins.

Refs SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
37a547604f auth: start authorizer and set permission loader before role manager
LDAP role manager starts a pruner fiber that calls
reload_all_permissions() which asserts _permission_loader is set.
The permission loader calls _authorizer->authorize(), so the
authorizer must be started before the loader is set.

Start authorizer, then set the permission loader, then start the
role manager, ensuring both dependencies are satisfied before the
pruner can fire.

Fixes SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
c3e5285d45 auth: stop role manager before clearing permission loader
service::stop() cleared the permission loader and stopped
the role manager concurrently (via when_all_succeed). The
LDAP pruner could be mid-reload at a yield point when the
loader was set to null, causing it to call a null function.

Stop the role manager first so the pruner is fully drained
before the loader is cleared.

Fixes SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Pavel Emelyanov
7ca8a863d9 storage_proxy: Keep own updateable_timeout_config
Storage_proxy was reading read_request_timeout_in_ms and
write_request_timeout_in_ms directly from db::config via
database::get_config() at four call sites. Give storage_proxy its own
updateable_timeout_config member (built from db::config the same way
cql transport controller and alternator server do) and use its
read_timeout_in_ms / write_timeout_in_ms observers instead.

Storage_proxy no longer needs database::get_config() for coordinator
timeout values. A later refactor may turn these per-owner copies into
references to a single shared updateable_timeout_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 14:27:09 +03:00
Andrzej Jackowski
f75e5ac65b auth: reload LDAP permission cache on local shard only
The LDAP role manager's _cache_pruner fiber used
invoke_on_all() to reload permissions on every shard.
Since auth::service::start() runs on all shards in
parallel via invoke_on_all(), the pruner on shard X
could call reload_all_permissions() on shard Y before
shard Y finished start() and set its permission loader,
hitting SCYLLA_ASSERT(_permission_loader). The same
cross-shard race existed during shutdown.

Each shard runs its own pruner instance, so reloading
locally is sufficient — all shards are still covered.
This also removes redundant N-squared reload calls.

Refs SCYLLADB-1679
2026-04-24 13:06:58 +02:00
Pavel Emelyanov
111165d9de view: Turn calculate_view_update_throttling_delay into node_update_backlog member
The free function calculate_view_update_throttling_delay() took the
view_flow_control_delay_limit_in_ms as a parameter, which forced its
two callers (storage_proxy and view_update_generator) to fish the
option out of db::config via database::get_config(). Now that the
option lives on node_update_backlog, make the throttling calculation a
member of node_update_backlog and have the callers invoke it on their
node_update_backlog reference.

This removes two database::get_config() call sites.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:52:12 +03:00
Pavel Emelyanov
855372db3c view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
Store the view_flow_control_delay_limit_in_ms config option as an
updateable_value on node_update_backlog. The value is threaded from
main.cc into the backlog object at construction time. Existing call
sites (tests) that construct node_update_backlog without the option
continue to work via a default argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:47:54 +03:00
Pavel Emelyanov
ec2339e635 view: Add node_update_backlog reference to view_update_generator
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:45:46 +03:00
Benny Halevy
6cb4c27f8c test/cluster/dtest/ccmlib/scylla_node: add debug logging
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-23 09:21:06 +03:00
Andrzej Jackowski
2503546251 test: audit: parameterize source address in audit assertions
Maintenance socket connections report a different source address
than regular CQL connections. Make the source field configurable
in the audit test helpers so that upcoming maintenance socket
tests can verify the correct address.

Also fix the syslog backend address parser to handle IPv6
addresses formatted as [ip]:port.

Refs SCYLLADB-1615
2026-04-23 07:02:02 +02:00
Piotr Szymaniak
9a86044c63 test: Stop providing alternator-streams experimental flag
Now that alternator-streams is no longer an experimental feature,
stop passing it in test configurations.
2026-04-22 15:25:37 +02:00
Piotr Szymaniak
870013b437 alternator: Graduate Alternator Streams from experimental
Alternator Streams were experimental until 2026.2, when they became GA.
Stop requiring `--experimental-features=alternator-streams` by:

- Removing ALTERNATOR_STREAMS from the experimental feature enum
- Mapping "alternator-streams" to UNUSED for backward compatibility
- Removing the gating that disabled the ALTERNATOR_STREAMS gossip
  feature when the experimental flag was absent
- Removing the runtime guard that rejected StreamSpecification requests
  without the feature flag
- Updating config_test to reflect the new UNUSED mapping

The gms::feature alternator_streams is kept for rolling upgrade
compatibility with older nodes.

Fixes SCYLLADB-1680
2026-04-22 15:22:15 +02:00
Michał Jadwiszczak
2b29962583 test/strong_consistency: verify metrics
This patch adds simple asserts to an existing `test_basic_write_read`
to verify that strong consistency metrics are correctly collected.
2026-04-22 10:06:49 +02:00
Michał Jadwiszczak
7352b37048 test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning 2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
396d4b17a0 docs: document tombstone avoidance in view_building_tasks 2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
1162fd315e view_building: add task_uuid_generator to view_building_task_mutation_builder
Following previous commit, use the generator in view building task mutation builder.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
b64f2d2e90 view_building: introduce task_uuid_generator
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.

This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
e5a6ed72b9 view_building: store min_alive_uuid in view building state
Because now we're limiting the range we're reading from view building
tasks table, we need to make sure that new tasks are created with larger
uuid then the `min_alive_uuid`.

In order to do it, we need to be able to see current `min_alive_uuid`
while creating new tasks.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8d0943ce35 view_building: set min_task_id when GC-ing finished tasks
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id
alongside the range tombstone in the same Raft batch. min_task_id is set
to min_alive_uuid so subsequent get_view_building_tasks() scans start
exactly at the first alive row, skipping all tombstoned rows.

When all tasks are deleted, min_task_id is set to a freshly generated UUID
to ensure future tasks (which will have larger timeuuids) are not skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
b689de0414 view_building: add min_task_id support to view_building_task_mutation_builder
Add set_min_task_id(id) which writes the min_task_id static cell to the main
"view_building" partition. The static cell is written as part of the same
mutation as the range tombstone, keeping everything in one Raft batch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8670111cd4 view_building: add min_task_id static column and bounded scan to system_keyspace
Add a min_task_id timeuuid static column to system.view_building_tasks.

When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks()
reads min_task_id first using a static-only partition slice (empty _row_ranges +
always_return_static_content). This makes the SSTable reader stop immediately
after the static row before processing any clustering tombstones, so the read
never triggers tombstone_warn_threshold warnings.

min_task_id is then used as AND id >= ? lower bound for the main task scan,
skipping all tombstoned rows below the boundary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8f741b462b view_building: use range tombstone when GC-ing finished tasks
Instead of issuing one row tombstone per finished task, collect all tasks
to delete, find the smallest timeuuid among alive tasks (min_alive_uuid),
then emit a single range tombstone [before_all, min_alive_uuid) covering
all tasks below that boundary. Tasks above the boundary (rare: finished
task interleaved with alive tasks) still get individual row tombstones.

When no alive tasks remain, del_all_tasks() covers the entire partition
with a single range tombstone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
91697d597c view_building: add range tombstone support to view_building_task_mutation_builder
Add del_tasks_before(id) which emits a range tombstone [before_all, id)
and del_all_tasks() which covers the entire clustering range. These will
be used by the coordinator to delete finished tasks in bulk instead of
issuing one row tombstone per task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
e0942bb45a view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
This feature will be used to gate the use of min_task_id static column
in system.view_building_tasks, which will be added in a subsequent commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:12 +02:00
Michał Jadwiszczak
f77c258c8e strong_consistency: wire up metrics to operations
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().

Count commit_status_unknown errors in coordinator::mutate().

Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
2026-04-22 08:59:59 +02:00
Michał Jadwiszczak
55293c34f8 strong_consistency: add stats struct and metrics registration
Introduce per-shard metrics infrastructure for strong consistency
operations under the "strong_consistency_coordinator" metrics category.

The stats struct contains latency histograms/summaries for reads and
writes (using timed_rate_moving_average_summary_and_histogram, same as
storage_proxy uses for eventual consistency), and uint64_t counters for
write_status_unknown, node bounces, and shard bounces.

Metrics are registered in the coordinator constructor but are not yet
wired to actual operations — all counters remain at zero.
2026-04-22 08:58:38 +02:00
Taras Veretilnyk
7cdf215999 sstables: make sstable::unlink() idempotent
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
2026-04-21 22:41:02 +02:00
Ernest Zaslavsky
9faaf1f09c test: extract object storage helpers to test/pylib/object_storage.py
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage).  All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e9724f52a9 test: add per-test bucket isolation to object_store fixtures
Create a unique S3/GCS bucket for each test function using the pytest
test name (from request.node.name), sanitized into a valid bucket name.
This ensures tests do not share state through a common bucket and makes
bucket names meaningful for debugging (e.g. test-basic-s3-a1b2c3d4).
Each fixture now calls create_test_bucket() on setup and
destroy_test_bucket() on teardown.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
8e02e99c36 s3: add client::make overload with custom retry strategy
Add a client::make overload that accepts a custom retry strategy,
allowing callers to override the default exponential backoff.
Use this in s3_test.cc with a test_retry_strategy that sleeps only
1ms between retries instead of exponential backoff, significantly
reducing test runtime for tests that encounter transient errors
during bucket creation/deletion.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e175088db5 test: add s3_test_fixture and migrate tests to per-bucket isolation
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
cc0b9791c7 s3: add create_bucket and delete_bucket to client
Add create_bucket (PUT /<bucket>) and delete_bucket (DELETE /<bucket>)
methods to s3::client, following the same make_request pattern used by
existing object operations.
These will be used by the test infrastructure to create per-test
isolated buckets.
2026-04-21 19:08:57 +03:00
Gleb Natapov
133768a1f0 db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused 2026-04-20 12:52:25 +03:00
Gleb Natapov
66b3fc4e2c db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
They are used only to prevent permission change, but since tables are
unused even if they exists there is no problem changing their
permissions, so no point keeping the definitions just for that.
2026-04-16 14:11:01 +03:00
Gleb Natapov
b55748ec54 db/system_distributed_keyspace: remove unused code
system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION table is not
created since scylla-4.6.0 but we still have code to mark it as sync.
2026-04-15 15:48:49 +03:00
Gleb Natapov
8713eda271 db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
The generation management moved to raft and old table is no longer
used.
2026-04-15 15:48:48 +03:00
Gleb Natapov
24171ce62b db/system_distributed_keyspace: drop old service_levels table
Service level management moved to raft and old table is no longer
supported.
2026-04-15 15:48:48 +03:00
Gleb Natapov
6a03768b35 fix indent after the previous patch 2026-04-15 15:48:48 +03:00
Gleb Natapov
0ef06a34ed group0: call setup_group0 only when needed
setup_group0 and setup_group0_if_exist have hidden condition inside that
make them no-op. It is not clear at the call site that functions may do
nothing. Change the code to check the conditions at the call site
instead.
2026-04-15 15:48:48 +03:00
Botond Dénes
9999e7b642 test/perf/perf_simple_query: add --collection=N
Defaults to 0. When N > 0, adds a map<blob, blob> collection column to
the schema. Each row will have a collection cell with N elements.
Allows benchmarking collection handling.
2026-04-15 09:46:54 +03:00
Botond Dénes
4ff96f0092 test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections 2026-04-15 09:46:54 +03:00
Botond Dénes
4eef5e1c65 mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
This cuts back on the number of allocations required for deserializing
collections, from O(num_cells) to O(1).

The visitor now receives an rvalue, so update all callers of
read_and_visit_row(), patching their vistors to take advantage of this
and move the serialized collection instead of copying it.
2026-04-15 09:46:54 +03:00
Botond Dénes
1bb04824a8 mutation/collection_mutation: introduce read_from_collection_cell_view()
Reads a collection_mutation directly from the IDL representation of a
collection. This cuts down the number of allocations required
drastically compared to the current method of:

    IDL -> collection_mutatio_description -> collection_mutation

Intended to be used in frozen_mutation::unfreeze() and similar use-cases.
2026-04-15 09:46:54 +03:00
Botond Dénes
5f2c003445 mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
atomic_cell_type has various static make_*() methods which create a
serialized cell based on the parameters. This patch adds write_()
methods which mirror the existing make_*() ones, with the exception that
the write methods write into caller-provided buffer. The make methods
are refactored to call the appropriate write overload.
*_serialized_size() methods are added as well, to calculate how many
bytes the serialized data will take after the appropriate write call.
This allows code to write cells directly into a pre-arranged buffer,
perhaps even multiple ones into the same one.

Since the intended use-case this patch prepares for is serializing an
entire collection directly into a single buffer, only make variants
which are legal in collections are handled. I.e. counters are not.
2026-04-15 09:46:54 +03:00
Botond Dénes
5b5cb94115 mutation/collection_mutation: generalize serialize_collection_mutation
This is already a template on Iterator, but generalize it further by
adding an Adaptor template which adapts the Iterator::value_type to the
requirements of the method. This allows passing Iterators with
value_type other than atomic_cell[_view].
2026-04-15 09:46:54 +03:00
Botond Dénes
17ac9da5d2 mutation/mutation_partition_view: avoid copying collection 2026-04-15 09:46:54 +03:00
Botond Dénes
aab336eb77 mutation/mutation_partition_view: accept collection_mutation in the consume API
Instead of collection_mutation_view. Follow-suit of the atomic_cell
overloads, which already accept a value, to allow for caller to move the
value along. The current interface forces collections to be copied.
2026-04-15 09:46:54 +03:00
Botond Dénes
652676e563 partition_builder: add move variant of accept_*_cell() collection overloads
Atomic cell overloads already have it, add it for the collection ones
too. Will be used to help copying collections unnecessarily.
2026-04-15 09:46:53 +03:00
316 changed files with 8326 additions and 3105 deletions

View File

@@ -4,6 +4,8 @@ on:
milestone:
types: [created, closed]
permissions: {}
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main

4
.gitignore vendored
View File

@@ -36,4 +36,6 @@ compile_commands.json
clang_build
.idea/
nuke
rust/target
rust/**/target
rust/**/Cargo.lock
test/resource/wasm/rust/target

View File

@@ -299,6 +299,7 @@ target_sources(scylla-main
serializer.cc
service/direct_failure_detector/failure_detector.cc
sstables_loader.cc
sstables_loader_helpers.cc
table_helper.cc
tasks/task_handler.cc
tasks/task_manager.cc

View File

@@ -247,6 +247,18 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2, bool v1_from
if (!v1) {
return false;
}
if (!v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator encountered malformed AttributeValue");
}
return false;
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::serialization("CONTAINS operator encountered malformed AttributeValue");
}
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name == "S" && kv2.name == "S") {
@@ -265,9 +277,17 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2, bool v1_from
}
}
} else if (kv1.name == "L") {
if (!kv1.value.IsArray()) {
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator received a malformed list");
}
return false;
}
for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {
if (!i->IsObject() || i->MemberCount() != 1) {
clogger.error("check_CONTAINS received a list whose element is malformed");
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator received a list whose element is malformed");
}
return false;
}
const auto& el = *i->MemberBegin();
@@ -681,7 +701,7 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::VALUE:
if (calculated_values.size() != 1) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unexpected values in primitive_condition", cond._values.size()));
throw std::logic_error(format("Unexpected values {} in primitive_condition", cond._values.size()));
}
// Unwrap the boolean wrapped as the value (if it is a boolean)
if (calculated_values[0].IsObject() && calculated_values[0].MemberCount() == 1) {

View File

@@ -38,6 +38,7 @@ controller::controller(
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
sharded<updateable_timeout_config>& timeout_config,
const db::config& config,
seastar::scheduling_group sg)
: protocol_server(sg)
@@ -52,6 +53,7 @@ controller::controller(
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _vsc(vsc)
, _timeout_config(timeout_config)
, _config(config)
{
}
@@ -99,7 +101,7 @@ future<> controller::start_server() {
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks), std::ref(_sys_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), std::ref(_vsc), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller), std::ref(_timeout_config)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion

View File

@@ -48,6 +48,8 @@ namespace vector_search {
class vector_store_client;
}
class updateable_timeout_config;
namespace alternator {
// This is the official DynamoDB API version.
@@ -72,6 +74,7 @@ class controller : public protocol_server {
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
sharded<vector_search::vector_store_client>& _vsc;
sharded<updateable_timeout_config>& _timeout_config;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -92,6 +95,7 @@ public:
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
sharded<updateable_timeout_config>& timeout_config,
const db::config& config,
seastar::scheduling_group sg);

View File

@@ -1362,6 +1362,33 @@ static int get_dimensions(const rjson::value& vector_attribute, std::string_view
return dimensions_v->GetInt();
}
// As noted in issue #5052, in Alternator the CreateTable and UpdateTable are
// currently synchronous - they return only after the operation is complete.
// After announce() of the new schema finished, the schema change is committed
// and a majority of nodes know it - but it's possible that some live nodes
// have not yet applied the new schema. If we return to the user now, and the
// user sends a node request that relies on the new schema, it might fail.
// So before returning, we must verify that *all* nodes have applied the new
// schema. This is what wait_for_schema_agreement_after_ddl() does.
//
// Note that wait_for_schema_agreement_after_ddl() has a timeout (currently
// hard-coded to 30 seconds). If the timeout is reached an InternalServerError
// is returned. The user, who doesn't know if the CreateTable succeeded or not,
// can retry the request and will get a ResourceInUseException and know the
// table already exists. So a CreateTable that returns a ResourceInUseException
// should also call wait_for_schema_agreement_after_ddl().
//
// When issue #5052 is resolved, this function can be removed - we will need
// to check if we reached schema agreement, but not to *wait* for it.
static future<> wait_for_schema_agreement_after_ddl(service::migration_manager& mm, const replica::database& db) {
static constexpr auto schema_agreement_seconds = 30;
try {
co_await mm.wait_for_schema_agreement(db, db::timeout_clock::now() + std::chrono::seconds(schema_agreement_seconds), nullptr);
} catch (const service::migration_manager::schema_agreement_timeout&) {
throw api_error::internal(fmt::format("The operation was successful, but unable to confirm cluster-wide schema agreement after {} seconds. Please retry the operation, and wait for the retry to report an error since the operation was already done.", schema_agreement_seconds));
}
}
future<executor::request_return_type> executor::create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization,
const db::tablets_mode_t::mode tablets_mode, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
throwing_assert(this_shard_id() == 0);
@@ -1695,13 +1722,26 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
}
}
}
bool table_already_exists = false;
try {
schema_mutations = service::prepare_new_keyspace_announcement(_proxy.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
if (_proxy.data_dictionary().has_schema(keyspace_name, table_name)) {
co_return api_error::resource_in_use(fmt::format("Table {} already exists", table_name));
table_already_exists = true;
}
}
if (table_already_exists) {
// The user may have retried a CreateTable operation after it timed
// out in wait_for_schema_agreement_after_ddl(). So before we may
// return ResourceInUseException (which can lead the user to start
// using the table which it now knows exists), we need to wait for
// schema agreement, just like the original CreateTable did. Again
// we fail with InternalServerError if schema agreement still cannot
// be reached. We can release group0_guard before waiting.
release_guard(std::move(group0_guard));
co_await wait_for_schema_agreement_after_ddl(_mm, _proxy.local_db());
co_return api_error::resource_in_use(fmt::format("Table {} already exists", table_name));
}
if (_proxy.data_dictionary().try_find_table(schema->id())) {
// This should never happen, the ID is supposed to be unique
co_return api_error::internal(format("Table with ID {} already exists", schema->id()));
@@ -1750,7 +1790,7 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
}
}
co_await _mm.wait_for_schema_agreement(_proxy.local_db(), db::timeout_clock::now() + 10s, nullptr);
co_await wait_for_schema_agreement_after_ddl(_mm, _proxy.local_db());
rjson::value status = rjson::empty_object();
executor::supplement_table_info(request, *schema, _proxy);
rjson::add(status, "TableDescription", std::move(request));
@@ -1860,7 +1900,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
rjson::value* stream_specification = rjson::find(request, "StreamSpecification");
if (stream_specification && stream_specification->IsObject()) {
empty_request = false;
if (add_stream_options(*stream_specification, builder, p.local())) {
if (add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options())) {
validate_cdc_log_name_length(builder.cf_name());
// On tablet tables, defer stream enablement and block
// tablet merges (see defer_enabling_streams_block_tablet_merges).
@@ -1875,6 +1915,23 @@ future<executor::request_return_type> executor::update_table(client_state& clien
if (tab->cdc_options().enabled() || tab->cdc_options().enable_requested()) {
co_return api_error::validation("Table already has an enabled stream: TableName: " + tab->cf_name());
}
// When re-enabling streams on an Alternator table, drop the old
// CDC log table first as a separate schema change, so the
// subsequent UpdateTable creates a fresh one with a new UUID
// (= new StreamArn). See #7239.
auto logname = cdc::log_name(tab->cf_name());
auto& local_db = p.local().local_db();
if (local_db.has_schema(tab->ks_name(), logname)
&& cdc::is_log_schema(*local_db.find_schema(tab->ks_name(), logname))) {
auto drop_m = co_await service::prepare_column_family_drop_announcement(
p.local(), tab->ks_name(), logname,
group0_guard.write_timestamp());
co_await mm.announce(std::move(drop_m), std::move(group0_guard),
format("alternator-executor: drop old CDC log for {}", tab->cf_name()));
co_await mm.wait_for_schema_agreement(
p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
continue;
}
}
else if (!tab->cdc_options().enabled() && !tab->cdc_options().enable_requested()) {
co_return api_error::validation("Table has no stream to disable: TableName: " + tab->cf_name());
@@ -2189,7 +2246,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
throw;
}
}
co_await mm.wait_for_schema_agreement(p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
co_await wait_for_schema_agreement_after_ddl(mm, p.local().local_db());
rjson::value status = rjson::empty_object();
supplement_table_info(request, *schema, p.local());

View File

@@ -30,6 +30,7 @@
#include "utils/updateable_value.hh"
#include "tracing/trace_state.hh"
#include "cdc/cdc_options.hh"
namespace db {
@@ -199,7 +200,7 @@ private:
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp, const cdc::options& existing_cdc_opts = {});
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};

View File

@@ -485,7 +485,7 @@ std::optional<bytes> unwrap_bytes(const rjson::value& value, bool from_query) {
return rjson::base64_decode(value);
} catch (...) {
if (from_query) {
throw api_error::serialization(format("Invalid base64 data"));
throw api_error::serialization("Invalid base64 data");
}
return std::nullopt;
}

View File

@@ -835,7 +835,7 @@ void server::set_routes(routes& r) {
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller)
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller, updateable_timeout_config& timeout_config)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
@@ -847,7 +847,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _max_users_query_size_in_trace_output(1024)
, _enabled_servers{}
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _timeout_config(timeout_config)
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);

View File

@@ -16,6 +16,7 @@
#include <seastar/net/tls.hh>
#include <optional>
#include "alternator/auth.hh"
#include "timeout_config.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
@@ -53,8 +54,8 @@ class server : public peering_sharded_service<server> {
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
// timeouts separately.
updateable_timeout_config& _timeout_config;
client_options_cache_type _connection_options_keys_and_values;
alternator_callbacks_map _callbacks;
@@ -98,7 +99,7 @@ class server : public peering_sharded_service<server> {
utils::scoped_item_list<ongoing_request> _ongoing_requests;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller, updateable_timeout_config& timeout_config);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,

View File

@@ -243,7 +243,10 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
if (!is_alternator_keyspace(ks_name)) {
continue;
}
if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {
// Use get_base_table instead of is_log_for_some_table because the
// latter requires CDC to be enabled, but we want to list streams
// that have been disabled but whose log table still exists (#7239).
if (cdc::get_base_table(db.real_database(), ks_name, cf_name)) {
rjson::value new_entry = rjson::empty_object();
auto arn = stream_arn{ i->schema(), cdc::get_base_table(db.real_database(), *i->schema()) };
@@ -392,7 +395,7 @@ std::istream& operator>>(std::istream& is, stream_view_type& type) {
return is;
}
static stream_view_type cdc_options_to_steam_view_type(const cdc::options& opts) {
static stream_view_type cdc_options_to_stream_view_type(const cdc::options& opts) {
stream_view_type type = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
type = stream_view_type::NEW_AND_OLD_IMAGES;
@@ -838,6 +841,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto& opts = bs->cdc_options();
auto status = "DISABLED";
bool stream_disabled = !opts.enabled();
if (opts.enabled()) {
if (!_cdc_metadata.streams_available()) {
@@ -853,7 +857,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));
stream_view_type type = cdc_options_to_steam_view_type(opts);
stream_view_type type = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamArn", stream_arn);
rjson::add(stream_desc, "StreamViewType", type);
@@ -861,10 +865,9 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
describe_key_schema(stream_desc, *bs);
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
}
// For disabled streams, we still fall through to enumerate shards
// below. All shards will have EndingSequenceNumber set, indicating
// they are closed. See issue #7239.
// TODO: label
// TODO: creation time
@@ -947,6 +950,12 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
// For a disabled stream, all shards are closed (#7239).
// Use "now" as the ending sequence number for the last
// generation's shards.
if (stream_disabled) {
return db_clock::now();
}
return std::nullopt;
}
// add this so we sort of match potential
@@ -1297,7 +1306,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
| std::ranges::to<query::column_id_vector>()
;
stream_view_type type = cdc_options_to_steam_view_type(base->cdc_options());
stream_view_type type = cdc_options_to_stream_view_type(base->cdc_options());
auto selection = cql3::selection::selection::for_columns(schema, std::move(columns));
auto partition_slice = query::partition_slice(
@@ -1481,17 +1490,17 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
if (!base->cdc_options().enabled()) {
// Stream is disabled -- all shards are closed (#7239).
// Don't return NextShardIterator.
} else if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
// Shard is still open with no records in the scanned window.
// Return the original iterator so the client can poll again.
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
@@ -1501,17 +1510,13 @@ future<executor::request_return_type> executor::get_records(client_state& client
co_return rjson::print(std::move(ret));
}
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp, const cdc::options& existing_cdc_opts) {
auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");
if (!stream_enabled || !stream_enabled->IsBool()) {
throw api_error::validation("StreamSpecification needs boolean StreamEnabled");
}
if (stream_enabled->GetBool()) {
if (!sp.features().alternator_streams) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}
cdc::options opts;
opts.enabled(true);
opts.tablet_merge_blocked(true);
@@ -1537,8 +1542,13 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
builder.with_cdc_options(opts);
return true;
} else {
cdc::options opts;
// When disabling, preserve the existing CDC options (preimage,
// postimage, ttl, etc.) so that DescribeStream can still report
// the correct StreamViewType on a disabled stream.
cdc::options opts = existing_cdc_opts;
opts.enabled(false);
opts.enable_requested(false);
opts.tablet_merge_blocked(false);
builder.with_cdc_options(opts);
return false;
}
@@ -1546,33 +1556,36 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp) {
auto& opts = schema.cdc_options();
if (opts.enabled()) {
auto db = sp.data_dictionary();
auto cf = db.find_table(schema.ks_name(), cdc::log_name(schema.cf_name()));
stream_arn arn(cf.schema(), cdc::get_base_table(db.real_database(), *cf.schema()));
// Report stream info when:
// 1. Log table exists (covers both enabled and disabled-but-readable).
// 2. enable_requested (ENABLING state, log not yet created).
auto db = sp.data_dictionary();
auto log_name = cdc::log_name(schema.cf_name());
auto log_cf = db.try_find_table(schema.ks_name(), log_name);
if (log_cf) {
auto log_schema = log_cf->schema();
stream_arn arn(log_schema, cdc::get_base_table(db.real_database(), *log_schema));
rjson::add(descr, "LatestStreamArn", arn);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));
} else if (!opts.enable_requested()) {
return;
}
// For both enabled() and enable_requested():
// DynamoDB returns StreamEnabled=true in StreamSpecification even when
// the stream status is ENABLING (not yet fully active). We mirror this
// behavior: enable_requested means the user asked for streams but CDC
// is not yet finalized, so we still report StreamEnabled=true.
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*log_schema)));
auto mode = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
mode = stream_view_type::NEW_AND_OLD_IMAGES;
} else if (opts.preimage()) {
mode = stream_view_type::OLD_IMAGE;
} else if (opts.postimage()) {
mode = stream_view_type::NEW_IMAGE;
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", opts.enabled());
stream_view_type mode = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
} else if (opts.enable_requested()) {
// DynamoDB returns StreamEnabled=true in StreamSpecification even when
// the stream status is ENABLING (not yet fully active). We mirror this
// behavior: enable_requested means the user asked for streams but CDC
// is not yet finalized, so we still report StreamEnabled=true.
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
stream_view_type mode = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
} // namespace alternator

View File

@@ -974,6 +974,54 @@
}
]
},
{
"path":"/storage_service/tablets/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"tablet_aware_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of the snapshot to restore from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"backup_location",
"description":"JSON array of backup location objects. Each object must contain: 'datacenter' (string), 'endpoint' (string), 'bucket' (string), and 'manifests' (array of strings). Currently, the array must contain exactly one entry.",
"required":true,
"allowMultiple":false,
"type":"array",
"paramType":"body"
}
]
}
]
},
{
"path":"/storage_service/keyspace_compaction/{keyspace}",
"operations":[

View File

@@ -527,11 +527,56 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
co_return json::json_return_type(fmt::to_string(task_id));
});
ss::tablet_aware_restore.set(r, [&ctx, &sst_loader](std::unique_ptr<http::request> req) -> future<json_return_type> {
std::string keyspace = req->get_query_param("keyspace");
std::string table = req->get_query_param("table");
std::string snapshot = req->get_query_param("snapshot");
rjson::chunked_content content = co_await util::read_entire_stream(*req->content_stream);
rjson::value parsed = rjson::parse(std::move(content));
if (!parsed.IsArray()) {
throw httpd::bad_param_exception("backup locations (in body) must be a JSON array");
}
const auto& locations = parsed.GetArray();
if (locations.Size() != 1) {
throw httpd::bad_param_exception("backup locations array (in body) must contain exactly one entry");
}
const auto& location = locations[0];
if (!location.IsObject()) {
throw httpd::bad_param_exception("backup location (in body) must be a JSON object");
}
auto endpoint = rjson::to_string_view(location["endpoint"]);
auto bucket = rjson::to_string_view(location["bucket"]);
auto dc = rjson::to_string_view(location["datacenter"]);
if (!location.HasMember("manifests") || !location["manifests"].IsArray()) {
throw httpd::bad_param_exception("backup location entry must have 'manifests' array");
}
auto manifests = location["manifests"].GetArray() |
std::views::transform([] (const auto& m) { return sstring(rjson::to_string_view(m)); }) |
std::ranges::to<utils::chunked_vector<sstring>>();
if (manifests.empty()) {
throw httpd::bad_param_exception("backup location 'manifests' array must not be empty");
}
apilog.info("Tablet restore for {}:{} called. Parameters: snapshot={} datacenter={} endpoint={} bucket={} manifests_count={}",
keyspace, table, snapshot, dc, endpoint, bucket, manifests.size());
auto table_id = validate_table(ctx.db.local(), keyspace, table);
auto task_id = co_await sst_loader.local().restore_tablets(table_id, keyspace, table, snapshot, sstring(endpoint), sstring(bucket), std::move(manifests));
co_return json::json_return_type(fmt::to_string(task_id));
});
}
void unset_sstables_loader(http_context& ctx, routes& r) {
ss::load_new_ss_tables.unset(r);
ss::start_restore.unset(r);
ss::tablet_aware_restore.unset(r);
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {

View File

@@ -194,22 +194,36 @@ future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token
std::move(audited_keyspaces),
std::move(audited_tables),
std::move(audited_categories),
std::cref(cfg))
.then([&cfg] {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {
return local_audit.start(cfg);
std::cref(cfg));
}
future<> audit::start_storage(const db::config& cfg) {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {
return local_audit._storage_helper_ptr->start(cfg).then([&local_audit] {
local_audit._storage_running = true;
});
});
}
future<> audit::stop_storage() {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([] (audit& local_audit) {
local_audit._storage_running = false;
return local_audit._storage_helper_ptr->stop();
});
}
future<> audit::stop_audit() {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit::audit::audit_instance().invoke_on_all([] (auto& local_audit) {
SCYLLA_ASSERT(!local_audit._storage_running);
return local_audit.shutdown();
}).then([] {
return audit::audit::audit_instance().stop();
@@ -223,14 +237,6 @@ audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& k
return std::make_unique<audit_info>(cat, keyspace, table, batch);
}
future<> audit::start(const db::config& cfg) {
return _storage_helper_ptr->start(cfg);
}
future<> audit::stop() {
return _storage_helper_ptr->stop();
}
future<> audit::shutdown() {
return make_ready_future<>();
}
@@ -241,6 +247,12 @@ future<> audit::log(const audit_info& audit_info, const service::client_state& c
const sstring& username = client_state.user() ? client_state.user()->name.value_or(anonymous_username) : no_username;
socket_address client_ip = client_state.get_client_address().addr();
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (!_storage_running) {
on_internal_error_noexcept(logger, fmt::format("Audit log dropped (storage not ready): node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
audit_info.query(), client_ip, audit_info.table(), username));
return make_ready_future<>();
}
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Log written: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
@@ -286,6 +298,11 @@ future<> inspect(const audit_info_alternator& ai, const service::client_state& c
future<> audit::log_login(const sstring& username, socket_address client_ip, bool error) noexcept {
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (!_storage_running) {
on_internal_error_noexcept(logger, fmt::format("Audit login log dropped (storage not ready): node_ip {} client_ip {} username {} error {}",
node_ip, client_ip, username, error ? "true" : "false"));
return make_ready_future<>();
}
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Login log written: node_ip {}, client_ip {}, username {}, error {}",
node_ip, client_ip, username, error ? "true" : "false");

View File

@@ -141,6 +141,7 @@ private:
category_set _audited_categories;
std::unique_ptr<storage_helper> _storage_helper_ptr;
bool _storage_running = false;
const db::config& _cfg;
utils::observer<sstring> _cfg_keyspaces_observer;
@@ -163,6 +164,8 @@ public:
return audit_instance().local();
}
static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> start_storage(const db::config& cfg);
static future<> stop_storage();
static future<> stop_audit();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);
audit(locator::shared_token_metadata& stm,
@@ -174,8 +177,6 @@ public:
category_set&& audited_categories,
const db::config& cfg);
~audit();
future<> start(const db::config& cfg);
future<> stop();
future<> shutdown();
bool should_log(const audit_info& audit_info) const;
bool will_log(statement_category cat, std::string_view keyspace = {}, std::string_view table = {}) const;

View File

@@ -185,24 +185,14 @@ future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& r
static const sstring q = format("SELECT role, name, value FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, ROLE_ATTRIBUTES_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
if (!r.has("value")) {
continue;
}
rec->attributes[r.get_as<sstring>("name")] =
r.get_as<sstring>("value");
co_await coroutine::maybe_yield();
}
}
// permissions
{
static const sstring q = format("SELECT role, resource, permissions FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, PERMISSIONS_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
auto resource = r.get_as<sstring>("resource");
auto perms_strings = r.get_set<sstring>("permissions");
std::unordered_set<sstring> perms_set(perms_strings.begin(), perms_strings.end());
auto pset = permissions::from_strings(perms_set);
rec->permissions[std::move(resource)] = std::move(pset);
co_await coroutine::maybe_yield();
}
}
co_return rec;
}

View File

@@ -44,7 +44,6 @@ public:
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring, sstring_hash, sstring_eq> attributes;
std::unordered_map<sstring, permission_set, sstring_hash, sstring_eq> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance

View File

@@ -76,7 +76,11 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
if (results->empty()) {
co_return permissions::NONE;
}
co_return permissions::from_strings(results->one().get_set<sstring>(PERMISSIONS_NAME));
const auto& row = results->one();
if (!row.has(PERMISSIONS_NAME)) {
co_return permissions::NONE;
}
co_return permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
}
future<>

View File

@@ -258,13 +258,11 @@ future<> ldap_role_manager::start() {
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
});
try {
co_await _cache.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
}
});
return _std_mgr.start();

View File

@@ -157,15 +157,12 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
return create_legacy_keyspace_if_missing(mm);
});
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
// Authorizer must be started before the permission loader is set,
// because the loader calls _authorizer->authorize().
// The loader must be set before starting the role manager, because
// LDAP role manager starts a pruner fiber that calls
// reload_all_permissions() which asserts _permission_loader is set.
co_await _authorizer->start();
if (!_used_by_maintenance_socket) {
// Maintenance socket mode can't cache permissions because it has
// different authorizer. We can't mix cached permissions, they could be
@@ -174,12 +171,27 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
// Authenticator must be started after ensure_superuser_is_created()
// because password_authenticator queries system.roles for the
// superuser entry created by the role manager.
co_await _authenticator->start();
}
future<> service::stop() {
_as.request_abort();
// Reverse of start() order.
co_await _authenticator->stop();
co_await _role_manager->stop();
_cache.set_permission_loader(nullptr);
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
co_await _authorizer->stop();
}
future<> service::ensure_superuser_is_created() {

View File

@@ -267,7 +267,7 @@ struct extract_row_visitor {
visit_collection(v);
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
throw std::runtime_error(format("extract_changes: unknown collection type: {}", o.name()));
}
));
}

View File

@@ -137,6 +137,24 @@ endfunction()
option(Scylla_WITH_DEBUG_INFO "Enable debug info" OFF)
# Time trace profiling: adds -ftime-trace to all C++ compilations (Clang only).
# Each .o produces a companion .json file in the build directory that can be
# analyzed with ClangBuildAnalyzer or loaded in chrome://tracing.
#
# Usage:
# cmake -DScylla_TIME_TRACE=ON ...
# ninja
# # Analyze results (requires ClangBuildAnalyzer):
# ClangBuildAnalyzer --all <build-dir> capture.bin
# ClangBuildAnalyzer --analyze capture.bin
option(Scylla_TIME_TRACE "Enable Clang -ftime-trace for build profiling" OFF)
if(Scylla_TIME_TRACE)
if(NOT CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
message(FATAL_ERROR "Scylla_TIME_TRACE requires Clang (found ${CMAKE_CXX_COMPILER_ID})")
endif()
add_compile_options(-ftime-trace)
endif()
macro(update_build_flags config)
cmake_parse_arguments (
parsed_args

View File

@@ -1088,7 +1088,7 @@ void compaction_manager::register_metrics() {
sm::make_gauge("normalized_backlog", [this] { return _last_backlog / available_memory(); },
sm::description("Holds the sum of normalized compaction backlog for all tables in the system. Backlog is normalized by dividing backlog by shard's available memory.")),
sm::make_counter("validation_errors", [this] { return _validation_errors; },
sm::description("Holds the number of encountered validation errors.")),
sm::description("Holds the number of encountered validation errors.")).set_skip_when_empty(),
});
}

View File

@@ -285,8 +285,12 @@ def generate_compdb(compdb, ninja, buildfile, modes):
os.symlink(compdb_target, compdb)
except FileExistsError:
# if there is already a valid compile_commands.json link in the
# source root, we are done.
pass
# source root, we are done. if it's a stale link, update it.
if os.path.islink(compdb):
current_target = os.readlink(compdb)
if not os.path.exists(current_target):
os.unlink(compdb)
os.symlink(compdb_target, compdb)
return
@@ -560,6 +564,7 @@ scylla_tests = set([
'test/boost/crc_test',
'test/boost/dict_trainer_test',
'test/boost/dirty_memory_manager_test',
'test/boost/tablet_aware_restore_test',
'test/boost/double_decker_test',
'test/boost/duration_test',
'test/boost/dynamic_bitset_test',
@@ -593,6 +598,7 @@ scylla_tests = set([
'test/boost/linearizing_input_stream_test',
'test/boost/lister_test',
'test/boost/locator_topology_test',
'test/boost/lock_tables_metadata_test',
'test/boost/log_heap_test',
'test/boost/logalloc_standard_allocator_segment_pool_backend_test',
'test/boost/logalloc_test',
@@ -853,6 +859,10 @@ arg_parser.add_argument('--coverage', action = 'store_true', help = 'Compile scy
arg_parser.add_argument('--build-dir', action='store', default='build',
help='Build directory path')
arg_parser.add_argument('--disable-precompiled-header', action='store_true', default=False, help='Disable precompiled header for scylla binary')
arg_parser.add_argument('--time-trace', action='store_true', default=False,
help='Enable Clang -ftime-trace for build profiling. '
'Each .o produces a .json file analyzable with '
'ClangBuildAnalyzer or chrome://tracing')
arg_parser.add_argument('-h', '--help', action='store_true', help='show this help message and exit')
args = arg_parser.parse_args()
if args.help:
@@ -1163,6 +1173,8 @@ scylla_core = (['message/messaging_service.cc',
'index/secondary_index_manager.cc',
'index/secondary_index.cc',
'index/vector_index.cc',
'index/fulltext_index.cc',
'index/index_option_utils.cc',
'utils/UUID_gen.cc',
'utils/i_filter.cc',
'utils/bloom_filter.cc',
@@ -1325,6 +1337,7 @@ scylla_core = (['message/messaging_service.cc',
'ent/ldap/ldap_connection.cc',
'reader_concurrency_semaphore.cc',
'sstables_loader.cc',
'sstables_loader_helpers.cc',
'utils/utf8.cc',
'utils/ascii.cc',
'utils/like_matcher.cc',
@@ -1464,6 +1477,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/frozen_mutation.idl.hh',
'idl/reconcilable_result.idl.hh',
'idl/streaming.idl.hh',
'idl/sstables_loader.idl.hh',
'idl/paging_state.idl.hh',
'idl/frozen_schema.idl.hh',
'idl/repair.idl.hh',
@@ -1659,6 +1673,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/auth_cache_test.cc',
'test/boost/auth_test.cc',
'test/boost/batchlog_manager_test.cc',
'test/boost/table_helper_test.cc',
'test/boost/cache_algorithm_test.cc',
'test/boost/castas_fcts_test.cc',
'test/boost/cdc_test.cc',
@@ -1710,7 +1725,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/sstable_compression_config_test.cc',
'test/boost/sstable_directory_test.cc',
'test/boost/sstable_set_test.cc',
'test/boost/sstable_tablet_streaming.cc',
'test/boost/sstable_tablet_streaming_test.cc',
'test/boost/statement_restrictions_test.cc',
'test/boost/storage_proxy_test.cc',
'test/boost/tablets_test.cc',
@@ -1965,6 +1980,9 @@ user_cflags += ' -fextend-variable-liveness=none'
if args.target != '':
user_cflags += ' -march=' + args.target
if args.time_trace:
user_cflags += ' -ftime-trace'
for mode in modes:
# Those flags are passed not only to Scylla objects, but also to libraries
# that we compile ourselves.
@@ -2457,6 +2475,9 @@ def write_build_file(f,
command = reloc/build_deb.sh --reloc-pkg $in --builddir $out
rule unified
command = unified/build_unified.sh --build-dir $builddir/$mode --unified-pkg $out
rule collect_pkgs
command = rm -rf $out && mkdir -p $out && cp $pkgs $out/
description = COLLECT $out
rule rust_header
command = cxxbridge --include rust/cxx.h --header $in > $out
description = RUST_HEADER $out
@@ -2942,6 +2963,8 @@ def write_build_file(f,
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-cqlsh-tar
build dist: phony dist-unified dist-server dist-python3 dist-cqlsh
build collect-dist: phony {' '.join([f'collect-dist-{mode}' for mode in default_modes])}
'''))
f.write(textwrap.dedent(f'''\
@@ -2949,7 +2972,28 @@ def write_build_file(f,
rule dist-check
command = ./tools/testing/dist-check/dist-check.sh --mode $mode
'''))
deb_arch = {'x86_64': 'amd64', 'aarch64': 'arm64'}[arch]
deb_ver = f'{scylla_version}-{scylla_release}-1'
rpm_ver = f'{scylla_version}-{scylla_release}'
for mode in build_modes:
server_rpms_dir = f'$builddir/dist/{mode}/redhat/RPMS/{arch}'
server_rpms = [f'{server_rpms_dir}/{scylla_product}{suffix}-{rpm_ver}.{arch}.rpm'
for suffix in ['', '-server', '-server-debuginfo', '-conf', '-kernel-conf', '-node-exporter']]
cqlsh_rpms = [f'tools/cqlsh/build/redhat/RPMS/{arch}/{scylla_product}-cqlsh-{rpm_ver}.{arch}.rpm']
python3_rpms = [f'tools/python3/build/redhat/RPMS/{arch}/{scylla_product}-python3-{rpm_ver}.{arch}.rpm']
all_rpms = server_rpms + cqlsh_rpms + python3_rpms
server_deb_dir = f'$builddir/dist/{mode}/debian'
server_debs = [f'{server_deb_dir}/{scylla_product}{suffix}_{deb_ver}_{deb_arch}.deb'
for suffix in ['', '-server', '-server-dbg', '-conf', '-kernel-conf', '-node-exporter']]
server_debs += [f'{server_deb_dir}/scylla-enterprise{suffix}_{deb_ver}_all.deb'
for suffix in ['', '-server', '-conf', '-kernel-conf', '-node-exporter']]
cqlsh_debs = [f'tools/cqlsh/build/debian/{scylla_product}-cqlsh_{deb_ver}_{deb_arch}.deb',
f'tools/cqlsh/build/debian/scylla-enterprise-cqlsh_{deb_ver}_all.deb']
python3_debs = [f'tools/python3/build/debian/{scylla_product}-python3_{deb_ver}_{deb_arch}.deb',
f'tools/python3/build/debian/scylla-enterprise-python3_{deb_ver}_all.deb']
all_debs = server_debs + cqlsh_debs + python3_debs
f.write(textwrap.dedent(f'''\
build $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
@@ -2957,6 +3001,11 @@ def write_build_file(f,
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-package.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/rpm: collect_pkgs | {' '.join(all_rpms)} $builddir/dist/{mode}/redhat dist-cqlsh-rpm dist-python3-rpm
pkgs = {' '.join(all_rpms)}
build $builddir/{mode}/dist/deb: collect_pkgs | {' '.join(all_debs)} $builddir/dist/{mode}/debian dist-cqlsh-deb dist-python3-deb
pkgs = {' '.join(all_debs)}
build collect-dist-{mode}: phony $builddir/{mode}/dist/rpm $builddir/{mode}/dist/deb
build {mode}-dist: phony dist-server-{mode} dist-server-debuginfo-{mode} dist-python3-{mode} dist-unified-{mode} dist-cqlsh-{mode}
build dist-{mode}: phony {mode}-dist
build dist-check-{mode}: dist-check

View File

@@ -136,9 +136,9 @@ public:
{}
future<> insert(auth::authenticated_user user, cql3::prepared_cache_key_type prep_cache_key, value_type v) noexcept {
return _cache.get_ptr(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return _cache.insert(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return make_ready_future<value_type>(std::move(v));
}).discard_result();
});
}
value_ptr find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {

View File

@@ -1070,7 +1070,7 @@ try_prepare_count_rows(const expr::function_call& fc, data_dictionary::database
.args = {},
};
} else {
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument", fc.args[0]));
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument, got {}", fc.args[0]));
}
}
}

View File

@@ -13,6 +13,7 @@
#include "cql3/prepare_context.hh"
#include "cql3/expr/expr-utils.hh"
#include "types/list.hh"
#include "types/tuple.hh"
#include <iterator>
#include <ranges>
@@ -116,6 +117,34 @@ void validate_token_relation(const std::vector<const column_definition*> column_
}
}
void validate_tuples_size(const expression& rhs, size_t valid_size) {
auto coll = as_if<collection_constructor>(&rhs);
if (!coll) {
// Pre-prepare, the IN list arrives as a collection_constructor.
// After prepare it would be a constant of list type whose elements
// are serialized; arity validation has already happened earlier in
// that case, so nothing to do here.
return;
}
for (const auto& expr : coll->elements) {
size_t expr_size = 0;
if (auto tuple = as_if<tuple_constructor>(&expr)) {
expr_size = tuple->elements.size();
} else {
auto the_const = as_if<constant>(&expr);
if (the_const && the_const->type->without_reversed().is_tuple()) {
const tuple_type_impl* const_tuple = dynamic_cast<const tuple_type_impl*>(&the_const->type->without_reversed());
expr_size = const_tuple->size();
} else {
continue; // not a tuple; perhaps we need to set expr_size to 1 here when #12554 is fixed
}
}
if (expr_size != valid_size) {
throw exceptions::invalid_request_exception(format("Expected {} elements in value tuple, but got {}: {}", valid_size, expr_size, expr));
}
}
}
void preliminary_binop_vaidation_checks(const binary_operator& binop) {
if (binop.op == oper_t::NEQ) {
throw exceptions::invalid_request_exception(format("Unsupported \"!=\" relation: {:user}", binop));
@@ -142,6 +171,10 @@ void preliminary_binop_vaidation_checks(const binary_operator& binop) {
throw exceptions::invalid_request_exception("LIKE cannot be used for Multi-column relations");
}
if (binop.op == oper_t::IN) {
validate_tuples_size(binop.rhs, lhs_tup->elements.size());
}
if (auto rhs_tup = as_if<tuple_constructor>(&binop.rhs)) {
if (lhs_tup->elements.size() != rhs_tup->elements.size()) {
throw exceptions::invalid_request_exception(

View File

@@ -343,102 +343,102 @@ to_predicates(
auto cdef = col.col;
auto type = &cdef->type->without_reversed();
if (oper.op == oper_t::IS_NOT) {
return to_vector(predicate{
.solve_for = nullptr,
.filter = oper,
.on = on_column{col.col},
.is_not_null_single_column = is_null_constant(oper.rhs),
.op = oper.op,
});
return to_vector(predicate{
.solve_for = nullptr,
.filter = oper,
.on = on_column{col.col},
.is_not_null_single_column = is_null_constant(oper.rhs),
.op = oper.op,
});
}
if (is_compare(oper.op)) {
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, std::move(*val));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, std::move(*val));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
} else if (oper.op == oper_t::IN) {
auto solve = [oper, type, cdef] (const query_options& options) {
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.is_in = true,
.order = oper.order,
.op = oper.op,
});
auto solve = [oper, type, cdef] (const query_options& options) {
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.is_in = true,
.order = oper.order,
.op = oper.op,
});
} else if (oper.op == oper_t::CONTAINS || oper.op == oper_t::CONTAINS_KEY) {
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return value_set(value_list{*val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return value_set(value_list{*val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.order = oper.order,
.op = oper.op,
});
}
return cannot_solve_on_column(oper, col.col);
},
[&] (const subscript& s) -> std::vector<predicate> {
const column_value& col = get_subscripted_column(s);
if (oper.op == oper_t::EQ) {
auto solve = [s, oper] (const query_options& options) {
managed_bytes_opt sval = evaluate(s.sub, options).to_managed_bytes_opt();
if (!sval) {
return empty_value_set; // NULL can't be a map key
}
if (oper.op == oper_t::EQ) {
auto solve = [s, oper] (const query_options& options) {
managed_bytes_opt sval = evaluate(s.sub, options).to_managed_bytes_opt();
if (!sval) {
return empty_value_set; // NULL can't be a map key
}
managed_bytes_opt rval = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!rval) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
managed_bytes_opt elements[] = {sval, rval};
managed_bytes val = tuple_type_impl::build_value_fragmented(elements);
return value_set(value_list{val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = true,
.equality = true,
.order = oper.order,
.op = oper.op,
.is_subscript = true,
});
}
return cannot_solve_on_column(oper, col.col);
managed_bytes_opt rval = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!rval) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
managed_bytes_opt elements[] = {sval, rval};
managed_bytes val = tuple_type_impl::build_value_fragmented(elements);
return value_set(value_list{val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = true,
.equality = true,
.order = oper.order,
.op = oper.op,
.is_subscript = true,
});
}
return cannot_solve_on_column(oper, col.col);
},
[&] (const tuple_constructor& tuple) -> std::vector<predicate> {
auto columns = tuple.elements
| std::views::transform([] (const expression& e) { return as<column_value>(e).col; })
| std::ranges::to<std::vector>();
| std::views::transform([] (const expression& e) { return as<column_value>(e).col; })
| std::ranges::to<std::vector>();
for (unsigned i = 0; i < columns.size(); ++i) {
if (!columns[i]->is_clustering_key() || columns[i]->position() != i) {
on_internal_error(rlogger, "to_predicates: multi-column relation not on a clustering key prefix");
@@ -481,42 +481,42 @@ to_predicates(
if (!(oper.op == oper_t::EQ || is_slice(oper.op))) {
return cannot_solve(oper);
}
auto solve = [oper] (const query_options& options) -> value_set {
auto val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no token values match.
}
if (oper.op == oper_t::EQ) {
return value_list{*val};
} else if (oper.op == oper_t::GT) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), exclusive));
} else if (oper.op == oper_t::GTE) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), inclusive));
}
static const managed_bytes MININT = managed_bytes(serialized(std::numeric_limits<int64_t>::min())),
MAXINT = managed_bytes(serialized(std::numeric_limits<int64_t>::max()));
// Undocumented feature: when the user types `token(...) < MININT`, we interpret
// that as MAXINT for some reason.
const auto adjusted_val = (*val == MININT) ? MAXINT : *val;
if (oper.op == oper_t::LT) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), exclusive));
} else if (oper.op == oper_t::LTE) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), inclusive));
}
throw std::logic_error(format("get_token_interval unexpected operator {}", oper.op));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_partition_key_token{table_schema_opt},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) -> value_set {
auto val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no token values match.
}
if (oper.op == oper_t::EQ) {
return value_list{*val};
} else if (oper.op == oper_t::GT) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), exclusive));
} else if (oper.op == oper_t::GTE) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), inclusive));
}
static const managed_bytes MININT = managed_bytes(serialized(std::numeric_limits<int64_t>::min())),
MAXINT = managed_bytes(serialized(std::numeric_limits<int64_t>::max()));
// Undocumented feature: when the user types `token(...) < MININT`, we interpret
// that as MAXINT for some reason.
const auto adjusted_val = (*val == MININT) ? MAXINT : *val;
if (oper.op == oper_t::LT) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), exclusive));
} else if (oper.op == oper_t::LTE) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), inclusive));
}
throw std::logic_error(format("get_token_interval unexpected operator {}", oper.op));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_partition_key_token{table_schema_opt},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
},
[&] (const binary_operator&) -> std::vector<predicate> {
return cannot_solve(oper);
@@ -555,7 +555,7 @@ to_predicates(
return cannot_solve(oper);
},
}, oper.lhs);
},
},
[] (const column_value& cv) -> std::vector<predicate> {
return cannot_solve(cv);
},
@@ -806,26 +806,26 @@ bool is_empty_restriction(const expression& e) {
static
std::function<bytes_opt (const query_options&)>
build_value_for_fn(const column_definition& cdef, const expression& e, const schema& s) {
auto ac = to_predicate_on_column(e, &cdef, &s);
return [ac] (const query_options& options) -> bytes_opt {
value_set possible_vals = solve(ac, options);
return std::visit(overloaded_functor {
[&](const value_list& val_list) -> bytes_opt {
if (val_list.empty()) {
return std::nullopt;
}
auto ac = to_predicate_on_column(e, &cdef, &s);
return [ac] (const query_options& options) -> bytes_opt {
value_set possible_vals = solve(ac, options);
return std::visit(overloaded_functor {
[&](const value_list& val_list) -> bytes_opt {
if (val_list.empty()) {
return std::nullopt;
}
if (val_list.size() != 1) {
on_internal_error(expr_logger, format("expr::value_for - multiple possible values for column: {}", ac.filter));
}
if (val_list.size() != 1) {
on_internal_error(expr_logger, format("expr::value_for - multiple possible values for column: {}", ac.filter));
}
return to_bytes(val_list.front());
},
[&](const interval<managed_bytes>&) -> bytes_opt {
on_internal_error(expr_logger, format("expr::value_for - possible values are a range: {}", ac.filter));
}
}, possible_vals);
};
return to_bytes(val_list.front());
},
[&](const interval<managed_bytes>&) -> bytes_opt {
on_internal_error(expr_logger, format("expr::value_for - possible values are a range: {}", ac.filter));
}
}, possible_vals);
};
}
bool contains_multi_column_restriction(const expression& e) {
@@ -1337,11 +1337,11 @@ statement_restrictions::ck_restrictions_need_filtering() const {
}
return has_partition_key_unrestricted_components()
|| clustering_key_restrictions_need_filtering()
// If token restrictions are present in an indexed query, then all other restrictions need to be filtered.
// A single token restriction can have multiple matching partition key values.
// Because of this we can't create a clustering prefix with more than token restriction.
|| (_uses_secondary_indexing && has_token_restrictions());
|| clustering_key_restrictions_need_filtering()
// If token restrictions are present in an indexed query, then all other restrictions need to be filtered.
// A single token restriction can have multiple matching partition key values.
// Because of this we can't create a clustering prefix with more than token restriction.
|| (_uses_secondary_indexing && has_token_restrictions());
}
bool
@@ -1705,28 +1705,28 @@ dht::partition_range_vector statement_restrictions::get_partition_key_ranges(con
get_partition_key_ranges_fn_t
statement_restrictions::build_partition_key_ranges_fn() const {
return std::visit(overloaded_functor{
[&] (const no_partition_range_restrictions&) -> get_partition_key_ranges_fn_t {
return [] (const query_options& options) -> dht::partition_range_vector{
return {dht::partition_range::make_open_ended_both_sides()};
};
},
[&] (const token_range_restrictions& r) -> get_partition_key_ranges_fn_t {
return [&] (const query_options& options) -> dht::partition_range_vector {
return partition_ranges_from_token(r.token_restrictions, options, *_schema);
};
},
[&] (const single_column_partition_range_restrictions& r) -> get_partition_key_ranges_fn_t {
if (_partition_range_is_simple) {
return [&] (const query_options& options) {
// Special case to avoid extra allocations required for a Cartesian product.
return partition_ranges_from_EQs(r.per_column_restrictions, options, *_schema);
[&] (const no_partition_range_restrictions&) -> get_partition_key_ranges_fn_t {
return [] (const query_options& options) -> dht::partition_range_vector{
return {dht::partition_range::make_open_ended_both_sides()};
};
} else {
return [&] (const query_options& options) {
return partition_ranges_from_singles(r.per_column_restrictions, options, *_schema);
},
[&] (const token_range_restrictions& r) -> get_partition_key_ranges_fn_t {
return [&] (const query_options& options) -> dht::partition_range_vector {
return partition_ranges_from_token(r.token_restrictions, options, *_schema);
};
}
}}, _partition_range_restrictions);
},
[&] (const single_column_partition_range_restrictions& r) -> get_partition_key_ranges_fn_t {
if (_partition_range_is_simple) {
return [&] (const query_options& options) {
// Special case to avoid extra allocations required for a Cartesian product.
return partition_ranges_from_EQs(r.per_column_restrictions, options, *_schema);
};
} else {
return [&] (const query_options& options) {
return partition_ranges_from_singles(r.per_column_restrictions, options, *_schema);
};
}
}}, _partition_range_restrictions);
}
namespace {
@@ -1970,28 +1970,28 @@ build_get_multi_column_clustering_bounds_fn(
}
});
}
return [schema, range_builders, all_natural, all_reverse] (const query_options& options) -> std::vector<query::clustering_range> {
multi_column_range_accumulator acc;
for (auto& builder : range_builders) {
builder(acc, options);
}
auto bounds = std::move(acc.ranges);
return [schema, range_builders, all_natural, all_reverse] (const query_options& options) -> std::vector<query::clustering_range> {
multi_column_range_accumulator acc;
for (auto& builder : range_builders) {
builder(acc, options);
}
auto bounds = std::move(acc.ranges);
if (!all_natural && !all_reverse) {
std::vector<query::clustering_range> bounds_in_clustering_order;
for (const auto& b : bounds) {
const auto eqv = get_equivalent_ranges(b, *schema);
bounds_in_clustering_order.insert(bounds_in_clustering_order.end(), eqv.cbegin(), eqv.cend());
if (!all_natural && !all_reverse) {
std::vector<query::clustering_range> bounds_in_clustering_order;
for (const auto& b : bounds) {
const auto eqv = get_equivalent_ranges(b, *schema);
bounds_in_clustering_order.insert(bounds_in_clustering_order.end(), eqv.cbegin(), eqv.cend());
}
return bounds_in_clustering_order;
}
return bounds_in_clustering_order;
}
if (all_reverse) {
for (auto& crange : bounds) {
crange = query::clustering_range(crange.end(), crange.start());
if (all_reverse) {
for (auto& crange : bounds) {
crange = query::clustering_range(crange.end(), crange.start());
}
}
}
return bounds;
};
return bounds;
};
}
/// Reverses the range if the type is reversed. Why don't we have interval::reverse()??
@@ -2288,17 +2288,17 @@ build_range_from_raw_bounds_fn(
std::vector<std::function<query::clustering_range (const query_options&)>> range_builders;
for (const auto& e : exprs | std::views::transform(&predicate::filter)) {
if (auto b = find_clustering_order(e)) {
range_builders.emplace_back([bb = *b, &schema] (const query_options& options) {
auto* b = &bb;
cql3::raw_value tup_val = expr::evaluate(b->rhs, options);
if (tup_val.is_null()) {
on_internal_error(rlogger, format("range_from_raw_bounds: unexpected atom {}", *b));
}
range_builders.emplace_back([bb = *b, &schema] (const query_options& options) {
auto* b = &bb;
cql3::raw_value tup_val = expr::evaluate(b->rhs, options);
if (tup_val.is_null()) {
on_internal_error(rlogger, format("range_from_raw_bounds: unexpected atom {}", *b));
}
const auto r = to_range(
const auto r = to_range(
b->op, clustering_key_prefix::from_optional_exploded(schema, expr::get_tuple_elements(tup_val, *type_of(b->rhs))));
return r;
});
return r;
});
}
}
return [range_builders] (const query_options& options) -> std::vector<query::clustering_range> {
@@ -2322,9 +2322,9 @@ build_range_from_raw_bounds_fn(
get_clustering_bounds_fn_t
statement_restrictions::build_get_clustering_bounds_fn() const {
if (_clustering_prefix_restrictions.empty()) {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return {query::clustering_range::make_open_ended_both_sides()};
};
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return {query::clustering_range::make_open_ended_both_sides()};
};
}
if (_clustering_prefix_restrictions[0].is_multi_column) {
bool all_natural = true, all_reverse = true; ///< Whether column types are reversed or natural.
@@ -2342,14 +2342,14 @@ statement_restrictions::build_get_clustering_bounds_fn() const {
}
}
}
return build_get_multi_column_clustering_bounds_fn(_schema, _clustering_prefix_restrictions,
all_natural, all_reverse);
} else {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return get_single_column_clustering_bounds(options, *_schema, _clustering_prefix_restrictions);
};
return build_get_multi_column_clustering_bounds_fn(_schema, _clustering_prefix_restrictions,
all_natural, all_reverse);
} else {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return get_single_column_clustering_bounds(options, *_schema, _clustering_prefix_restrictions);
};
}
}
}
std::vector<query::clustering_range> statement_restrictions::get_clustering_bounds(const query_options& options) const {
return _get_clustering_bounds_fn(options);
@@ -2475,11 +2475,11 @@ void statement_restrictions::prepare_indexed_global(const schema& idx_tbl_schema
_idx_tbl_ck_prefix->reserve(_idx_tbl_ck_prefix->size() + idx_tbl_schema.clustering_key_size());
auto *single_column_partition_key_restrictions = std::get_if<single_column_partition_range_restrictions>(&_partition_range_restrictions);
if (single_column_partition_key_restrictions) {
for (const auto& e : single_column_partition_key_restrictions->per_column_restrictions) {
const auto col = require_on_single_column(e);
const auto pos = _schema->position(*col) + 1;
(*_idx_tbl_ck_prefix)[pos] = replace_column_def(e, &idx_tbl_schema.clustering_column_at(pos));
}
for (const auto& e : single_column_partition_key_restrictions->per_column_restrictions) {
const auto col = require_on_single_column(e);
const auto pos = _schema->position(*col) + 1;
(*_idx_tbl_ck_prefix)[pos] = replace_column_def(e, &idx_tbl_schema.clustering_column_at(pos));
}
}
if (std::ranges::any_of(*_idx_tbl_ck_prefix | std::views::drop(1) | std::views::transform(&predicate::filter), is_empty_restriction)) {
@@ -2621,10 +2621,10 @@ statement_restrictions::build_get_global_index_clustering_ranges_fn() const {
return {};
}
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_global_index_clustering_ranges(
@@ -2643,14 +2643,14 @@ statement_restrictions::build_get_global_index_token_clustering_ranges_fn() cons
// In old indexes the token column was of type blob.
// This causes problems with sorting and must be handled separately.
if (token_column.type != long_type) {
return [&] (const query_options& options) {
return get_index_v1_token_range_clustering_bounds(options, token_column, _idx_tbl_ck_prefix->at(0));
};
return [&] (const query_options& options) {
return get_index_v1_token_range_clustering_bounds(options, token_column, _idx_tbl_ck_prefix->at(0));
};
}
return [&] (const query_options& options) {
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_global_index_token_clustering_ranges(
@@ -2664,10 +2664,10 @@ statement_restrictions::build_get_local_index_clustering_ranges_fn() const {
return {};
}
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_local_index_clustering_ranges(

View File

@@ -351,6 +351,9 @@ public:
if (agg.state_to_result_function) {
ret.push_back(agg.state_to_result_function);
}
if (agg.state_reduction_function) {
ret.push_back(agg.state_reduction_function);
}
}
}
return false;

View File

@@ -71,7 +71,7 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {
bool is_write = true;
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write);
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write, coordinator.get().get_stats());
}
utils::get_local_injector().inject("sc_modification_statement_timeout", [&] {
throw exceptions::mutation_write_timeout_exception{"", "", options.get_consistency(), 0, 0, db::write_type::SIMPLE};

View File

@@ -47,7 +47,7 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
bool is_write = false;
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write);
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write, coordinator.get().get_stats());
}
co_return co_await process_results(get<lw_shared_ptr<query::result>>(std::move(query_result)),

View File

@@ -12,19 +12,23 @@
#include "cql3/query_processor.hh"
#include "replica/database.hh"
#include "locator/tablet_replication_strategy.hh"
#include "service/strong_consistency/coordinator.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(query_processor& qp,
const query_options& options,
const locator::tablet_replica& target,
db::timeout_clock::time_point timeout,
bool is_write)
bool is_write,
service::strong_consistency::stats& stats)
{
auto&& func_values_cache = const_cast<cql3::query_options&>(options).take_cached_pk_function_calls();
const auto my_host_id = qp.db().real_database().get_token_metadata().get_topology().my_host_id();
if (target.host != my_host_id) {
++(is_write ? stats.write_node_bounces : stats.read_node_bounces);
co_return qp.bounce_to_node(target, std::move(func_values_cache), timeout, is_write);
}
++(is_write ? stats.write_shard_bounces : stats.read_shard_bounces);
co_return qp.bounce_to_shard(target.shard, std::move(func_values_cache));
}

View File

@@ -11,6 +11,8 @@
#include "cql3/cql_statement.hh"
#include "locator/tablets.hh"
namespace service::strong_consistency { struct stats; }
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(
@@ -18,7 +20,8 @@ future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement
const query_options& options,
const locator::tablet_replica& target,
db::timeout_clock::time_point timeout,
bool is_write);
bool is_write,
service::strong_consistency::stats& stats);
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name);

View File

@@ -339,7 +339,7 @@ static storage_options::object_storage object_storage_from_map(std::string_view
}
if (values.size() > allowed_options.size()) {
throw std::runtime_error(fmt::format("Extraneous options for {}: {}; allowed: {}",
fmt::join(values | std::views::keys, ","), type,
type, fmt::join(values | std::views::keys, ","),
fmt::join(allowed_options | std::views::keys, ",")));
}
options.type = std::string(type);

View File

@@ -776,7 +776,7 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
friend std::ostream& operator<<(std::ostream&, const segment&);
friend class segment_manager;
size_t sector_overhead(size_t size) const {
constexpr size_t sector_overhead(size_t size) const {
return (size / (_alignment - detail::sector_overhead_size)) * detail::sector_overhead_size;
}
@@ -1028,18 +1028,21 @@ public:
co_return me;
}
/**
* Allocate a new buffer
*/
void new_buffer(size_t s) {
SCYLLA_ASSERT(_buffer.empty());
std::tuple<size_t, size_t> buffer_usage_size(size_t s) const {
auto overhead = segment_overhead_size;
if (_file_pos == 0) {
overhead += descriptor_header_size;
}
s += overhead;
return {s + overhead, overhead};
}
/**
* Allocate a new buffer
*/
void new_buffer(size_t size_in) {
SCYLLA_ASSERT(_buffer.empty());
auto [s, overhead] = buffer_usage_size(size_in);
// add bookkeep data reqs.
auto a = align_up(s + sector_overhead(s), _alignment);
auto k = std::max(a, default_size);
@@ -1427,6 +1430,9 @@ public:
position_type next_position(size_t size) const {
auto used = _buffer_ostream_size - _buffer_ostream.size();
if (used == 0) { // new chunk/segment
std::tie(size, std::ignore) = buffer_usage_size(size);
}
used += size;
return _file_pos + used + sector_overhead(used);
}
@@ -1570,7 +1576,6 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
clogger.debug("Attempting oversized alloc of {} entry writer", writer.num_entries);
auto size = writer.size();
auto max_file_size = cfg.commitlog_segment_size_in_mb * 1024 * 1024;
// check if this cannot be written at all...
if (!cfg.allow_going_over_size_limit) {
@@ -1579,11 +1584,11 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
// more worst case
auto size_with_meta_overhead = size_with_sector_overhead
+ (1 + size_with_sector_overhead/max_mutation_size) * (segment::entry_overhead_size + segment::fragmented_entry_overhead_size + segment::segment_overhead_size)
* (1 + size_with_sector_overhead/max_file_size) * segment::descriptor_header_size
* (1 + size_with_sector_overhead/max_size) * segment::descriptor_header_size
;
// this is not really true. We could have some space in current segment,
// but again, lets be conservative.
auto max_file_size_avail = max_disk_size - max_file_size;
auto max_file_size_avail = max_disk_size - max_size;
if (size_with_meta_overhead > max_file_size_avail) {
throw std::invalid_argument(fmt::format("Mutation of {} bytes is too large for potentially available disk space of {}", size, max_file_size_avail));
@@ -1770,11 +1775,13 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
co_await s->close();
s = co_await get_segment();
}
// bytes not counting overhead
auto buf_rem = std::min(max_size - s->position(), s->_buffer_ostream.size());
// bytes not counting overhead
auto pos = s->position();
auto max = std::max<size_t>(pos, max_size);
auto buf_rem = std::min(max_size - max, s->_buffer_ostream.size());
size_t avail;
if (buf_rem > align) {
if (buf_rem >= align) {
auto rem2 = buf_rem - (1 + buf_rem/sector_size) * detail::sector_overhead_size;
avail = std::min(rem2, max_mutation_size)
- segment::entry_overhead_size
@@ -1784,7 +1791,7 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
} else {
co_await s->cycle();
auto pos = s->position();
auto max = std::max<size_t>(pos, max_file_size);
auto max = std::max<size_t>(pos, max_size);
auto file_rem = max - pos;
if (file_rem < align) {

View File

@@ -217,7 +217,7 @@ future<> db::commitlog_replayer::impl::process(stats* s, commitlog::buffer_and_r
if (cm_it == local_cm.end()) {
if (!cer.get_column_mapping()) {
rlogger.debug("replaying at {} v={} at {}", fm.column_family_id(), fm.schema_version(), rp);
throw std::runtime_error(format("unknown schema version {}, table=", fm.schema_version(), fm.column_family_id()));
throw std::runtime_error(format("unknown schema version {}, table={}", fm.schema_version(), fm.column_family_id()));
}
rlogger.debug("new schema version {} in entry {}", fm.schema_version(), rp);
cm_it = local_cm.emplace(fm.schema_version(), *cer.get_column_mapping()).first;

View File

@@ -1429,6 +1429,13 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_shard_aware_drivers(this, "enable_shard_aware_drivers", value_status::Used, true, "Enable native transport drivers to use connection-per-shard for better performance.")
, enable_ipv6_dns_lookup(this, "enable_ipv6_dns_lookup", value_status::Used, false, "Use IPv6 address resolution")
, abort_on_internal_error(this, "abort_on_internal_error", liveness::LiveUpdate, value_status::Used, false, "Abort the server instead of throwing exception when internal invariants are violated.")
, abort_on_malformed_sstable_error(this, "abort_on_malformed_sstable_error", liveness::LiveUpdate, value_status::Used,
#if defined(DEBUG) || defined(DEVEL)
true,
#else
false,
#endif
"Abort the server and generate a coredump instead of throwing an exception when any sstable parse error is detected (malformed_sstable_exception, bufsize_mismatch_exception, parse_assert() failures, or BTI parse errors). Intended for debugging memory corruption that may manifest as sstable corruption. Defaults to true in debug and dev builds.")
, max_partition_key_restrictions_per_query(this, "max_partition_key_restrictions_per_query", liveness::LiveUpdate, value_status::Used, 100,
"Maximum number of distinct partition keys restrictions per query. This limit places a bound on the size of IN tuples, "
"especially when multiple partition key columns have IN restrictions. Increasing this value can result in server instability.")
@@ -1921,7 +1928,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"lwt", feature::UNUSED},
{"udf", feature::UDF},
{"cdc", feature::UNUSED},
{"alternator-streams", feature::ALTERNATOR_STREAMS},
{"alternator-streams", feature::UNUSED},
{"alternator-ttl", feature::UNUSED },
{"consistent-topology-changes", feature::UNUSED},
{"broadcast-tables", feature::BROADCAST_TABLES},

View File

@@ -115,7 +115,6 @@ struct experimental_features_t {
enum class feature {
UNUSED,
UDF,
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
STRONGLY_CONSISTENT_TABLES,
@@ -457,6 +456,7 @@ public:
named_value<bool> enable_shard_aware_drivers;
named_value<bool> enable_ipv6_dns_lookup;
named_value<bool> abort_on_internal_error;
named_value<bool> abort_on_malformed_sstable_error;
named_value<uint32_t> max_partition_key_restrictions_per_query;
named_value<uint32_t> max_clustering_key_restrictions_per_query;
named_value<uint64_t> max_memory_for_unlimited_query_soft_limit;

View File

@@ -327,7 +327,7 @@ redistribute(const std::vector<float>& p, unsigned me, unsigned k) {
}
}
hr_logger.trace(" pp after1=", pp);
hr_logger.trace(" pp after1={}", pp);
if (d.first == me) {
// We only care what "me" sends, and only the elements in
// the sorted list earlier than me could have forced it to

View File

@@ -29,6 +29,9 @@ class large_data_handler {
public:
struct stats {
int64_t partitions_bigger_than_threshold = 0; // number of large partition updates exceeding threshold_bytes
int64_t rows_bigger_than_threshold = 0; // number of large row updates exceeding row_threshold_bytes
int64_t cells_bigger_than_threshold = 0; // number of large cell updates exceeding cell_threshold_bytes
int64_t collections_bigger_than_threshold = 0; // number of large collection updates exceeding collection_elements_count_threshold
};
private:
@@ -82,6 +85,7 @@ public:
const clustering_key_prefix* clustering_key, uint64_t row_size) {
SCYLLA_ASSERT(running());
if (row_size > _row_threshold_bytes) [[unlikely]] {
++_stats.rows_bigger_than_threshold;
return with_sem([&sst, &partition_key, clustering_key, row_size, this] {
return record_large_rows(sst, partition_key, clustering_key, row_size);
}).then([] {
@@ -102,6 +106,8 @@ public:
const clustering_key_prefix* clustering_key, const column_definition& cdef, uint64_t cell_size, uint64_t collection_elements) {
SCYLLA_ASSERT(running());
above_threshold_result above_threshold{.size = cell_size > _cell_threshold_bytes, .elements = collection_elements > _collection_elements_count_threshold};
_stats.cells_bigger_than_threshold += above_threshold.size;
_stats.collections_bigger_than_threshold += above_threshold.elements;
if (above_threshold.size || above_threshold.elements) [[unlikely]] {
return with_sem([&sst, &partition_key, clustering_key, &cdef, cell_size, collection_elements, this] {
return record_large_cells(sst, partition_key, clustering_key, cdef, cell_size, collection_elements);

View File

@@ -17,7 +17,6 @@
#include "db/snapshot-ctl.hh"
#include "db/snapshot/backup_task.hh"
#include "schema/schema_fwd.hh"
#include "sstables/exceptions.hh"
#include "sstables/sstables.hh"
#include "sstables/sstable_directory.hh"
#include "sstables/sstables_manager.hh"
@@ -164,22 +163,23 @@ future<> backup_task_impl::process_snapshot_dir() {
auto file_path = _snapshot_dir / name;
auto st = co_await file_stat(directory, name);
total += st.size;
try {
auto desc = sstables::parse_path(file_path, "", "");
const auto& gen = desc.generation;
_sstable_comps[gen].emplace_back(name);
_sstables_in_snapshot.insert(desc.generation);
++num_sstable_comps;
// When the SSTable is only linked-to by the snapshot directory,
// it is already deleted from the table's base directory, and
// therefore it better be uploaded earlier to free-up its capacity.
if (desc.component == sstables::component_type::Data && st.number_of_links == 1) {
snap_log.debug("backup_task: SSTable with generation {} is already deleted from the table", gen);
_deleted_sstables.push_back(gen);
}
} catch (const sstables::malformed_sstable_exception&) {
auto result = sstables::parse_path(file_path, "", "");
if (!result) {
_files.emplace_back(name);
continue;
}
auto desc = std::move(*result);
const auto& gen = desc.generation;
_sstable_comps[gen].emplace_back(name);
_sstables_in_snapshot.insert(desc.generation);
++num_sstable_comps;
// When the SSTable is only linked-to by the snapshot directory,
// it is already deleted from the table's base directory, and
// therefore it better be uploaded earlier to free-up its capacity.
if (desc.component == sstables::component_type::Data && st.number_of_links == 1) {
snap_log.debug("backup_task: SSTable with generation {} is already deleted from the table", gen);
_deleted_sstables.push_back(gen);
}
}
_total_progress.total = total;

View File

@@ -13,7 +13,6 @@
#include "replica/database.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "db/config.hh"
#include "schema/schema_builder.hh"
#include "timeout_config.hh"
#include "types/types.hh"
@@ -22,8 +21,6 @@
#include "cdc/generation.hh"
#include "cql3/query_processor.hh"
#include "service/storage_proxy.hh"
#include "gms/feature_service.hh"
#include "service/migration_manager.hh"
#include "locator/host_id.hh"
@@ -41,27 +38,10 @@ static logging::logger dlogger("system_distributed_keyspace");
extern logging::logger cdc_log;
namespace db {
namespace {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
{
builder.set_wait_for_sync_to_commitlog(true);
}
});
}
extern thread_local data_type cdc_streams_set_type;
thread_local data_type cdc_streams_set_type = set_type_impl::get_instance(bytes_type, false);
/* See `token_range_description` struct */
thread_local data_type cdc_streams_list_type = list_type_impl::get_instance(bytes_type, false);
thread_local data_type cdc_token_range_description_type = tuple_type_impl::get_instance(
{ long_type // dht::token token_range_end;
, cdc_streams_list_type // std::vector<stream_id> streams;
, byte_type // uint8_t sharding_ignore_msb;
});
thread_local data_type cdc_generation_description_type = list_type_impl::get_instance(cdc_token_range_description_type, false);
schema_ptr view_build_status() {
static thread_local auto schema = [] {
@@ -77,42 +57,6 @@ schema_ptr view_build_status() {
return schema;
}
/* An internal table used by nodes to exchange CDC generation data. */
schema_ptr cdc_generations_v2() {
thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2);
return schema_builder(system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2, {id})
/* The unique identifier of this generation. */
.with_column("id", uuid_type, column_kind::partition_key)
/* The generation describes a mapping from all tokens in the token ring to a set of stream IDs.
* This mapping is built from a bunch of smaller mappings, each describing how tokens in a subrange
* of the token ring are mapped to stream IDs; these subranges together cover the entire token ring.
* Each such range-local mapping is represented by a row of this table.
* The clustering key of the row is the end of the range being described by this row.
* The start of this range is the range_end of the previous row (in the clustering order, which is the integer order)
* or of the last row of this partition if this is the first the first row. */
.with_column("range_end", long_type, column_kind::clustering_key)
/* The set of streams mapped to in this range.
* The number of streams mapped to a single range in a CDC generation is bounded from above by the number
* of shards on the owner of that range in the token ring.
* In other words, the number of elements of this set is bounded by the maximum of the number of shards
* over all nodes. The serialized size is obtained by counting about 20B for each stream.
* For example, if all nodes in the cluster have at most 128 shards,
* the serialized size of this set will be bounded by ~2.5 KB. */
.with_column("streams", cdc_streams_set_type)
/* The value of the `ignore_msb` sharding parameter of the node which was the owner of this token range
* when the generation was first created. Together with the set of streams above it fully describes
* the mapping for this particular range. */
.with_column("ignore_msb", byte_type)
/* Column used for sanity checking.
* For a given generation it's equal to the number of ranges in this generation;
* thus, after the generation is fully inserted, it must be equal to the number of rows in the partition. */
.with_column("num_ranges", int32_type, column_kind::static_column)
.with_hash_version()
.build();
}();
return schema;
}
/* A user-facing table providing identifiers of the streams used in CDC generations. */
schema_ptr cdc_desc() {
@@ -155,14 +99,43 @@ static const sstring CDC_TIMESTAMPS_KEY = "timestamps";
schema_ptr service_levels() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS);
auto builder = schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS, std::make_optional(id))
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS, std::make_optional(id))
.with_column("service_level", utf8_type, column_kind::partition_key)
.with_column("shares", int32_type);
if (utils::get_local_injector().is_enabled("service_levels_v1_table_without_shares")) {
builder.remove_column("shares");
}
.with_column("timeout", duration_type)
.with_column("workload_type", utf8_type)
.with_column("shares", int32_type)
.with_hash_version()
.build();
}();
return schema;
}
return builder
schema_ptr snapshot_sstables() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::SNAPSHOT_SSTABLES);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SNAPSHOT_SSTABLES, std::make_optional(id))
// Name of the snapshot
.with_column("snapshot_name", utf8_type, column_kind::partition_key)
// Keyspace where the snapshot was taken
.with_column("keyspace", utf8_type, column_kind::partition_key)
// Table within the keyspace
.with_column("table", utf8_type, column_kind::partition_key)
// Datacenter where this SSTable is located
.with_column("datacenter", utf8_type, column_kind::partition_key)
// Rack where this SSTable is located
.with_column("rack", utf8_type, column_kind::partition_key)
// First token in the token range covered by this SSTable
.with_column("first_token", long_type, column_kind::clustering_key)
// Unique identifier for the SSTable (UUID)
.with_column("sstable_id", uuid_type, column_kind::clustering_key)
// Last token in the token range covered by this SSTable
.with_column("last_token", long_type)
// TOC filename of the SSTable
.with_column("toc_name", utf8_type)
// Prefix path in object storage where the SSTable was backed up
.with_column("prefix", utf8_type)
// Flag if the SSTable was downloaded already
.with_column("downloaded", boolean_type)
.with_hash_version()
.build();
}();
@@ -182,19 +155,15 @@ schema_ptr service_levels() {
static std::vector<schema_ptr> ensured_tables() {
return {
view_build_status(),
cdc_generations_v2(),
cdc_desc(),
cdc_timestamps(),
service_levels(),
snapshot_sstables(),
};
}
std::vector<schema_ptr> system_distributed_keyspace::all_distributed_tables() {
return {view_build_status(), cdc_desc(), cdc_timestamps(), service_levels()};
}
std::vector<schema_ptr> system_distributed_keyspace::all_everywhere_tables() {
return {cdc_generations_v2()};
return {view_build_status(), cdc_desc(), cdc_timestamps(), service_levels(), snapshot_sstables()};
}
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm, service::storage_proxy& sp)
@@ -203,36 +172,6 @@ system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor&
, _sp(sp) {
}
static std::vector<std::pair<std::string_view, data_type>> new_service_levels_columns(bool workload_prioritization_enabled) {
std::vector<std::pair<std::string_view, data_type>> new_columns {{"timeout", duration_type}, {"workload_type", utf8_type}};
if (workload_prioritization_enabled) {
new_columns.push_back({"shares", int32_type});
}
return new_columns;
};
static schema_ptr get_current_service_levels(data_dictionary::database db) {
return db.has_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS)
? db.find_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS)
: service_levels();
}
static schema_ptr get_updated_service_levels(data_dictionary::database db, bool workload_prioritization_enabled) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto schema = get_current_service_levels(db);
schema_builder b(schema);
for (const auto& col : new_service_levels_columns(workload_prioritization_enabled)) {
auto& [col_name, col_type] = col;
bytes options_name = to_bytes(col_name.data());
if (schema->get_column_definition(options_name)) {
continue;
}
b.with_column(options_name, col_type, column_kind::regular_column);
}
b.with_hash_version();
return b.build();
}
future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tables) {
if (this_shard_id() != 0) {
_started = true;
@@ -243,11 +182,9 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
while (true) {
// Check if there is any work to do before taking the group 0 guard.
bool workload_prioritization_enabled = _sp.features().workload_prioritization;
bool keyspaces_setup = db.has_keyspace(NAME) && db.has_keyspace(NAME_EVERYWHERE);
bool keyspaces_setup = db.has_keyspace(NAME);
bool tables_setup = std::all_of(tables.begin(), tables.end(), [db] (schema_ptr t) { return db.has_schema(t->ks_name(), t->cf_name()); } );
bool service_levels_up_to_date = get_current_service_levels(db)->equal_columns(*get_updated_service_levels(db, workload_prioritization_enabled));
if (keyspaces_setup && tables_setup && service_levels_up_to_date) {
if (keyspaces_setup && tables_setup) {
dlogger.info("system_distributed(_everywhere) keyspaces and tables are up-to-date. Not creating");
_started = true;
co_return;
@@ -258,51 +195,25 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
utils::chunked_vector<mutation> mutations;
sstring description;
auto sd_ksm = keyspace_metadata::new_keyspace(
auto ksm = keyspace_metadata::new_keyspace(
NAME,
"org.apache.cassandra.locator.SimpleStrategy",
{{"replication_factor", "3"}},
std::nullopt, std::nullopt);
if (!db.has_keyspace(NAME)) {
mutations = service::prepare_new_keyspace_announcement(db.real_database(), sd_ksm, ts);
mutations = service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts);
description += format(" create {} keyspace;", NAME);
} else {
dlogger.info("{} keyspace is already present. Not creating", NAME);
}
auto sde_ksm = keyspace_metadata::new_keyspace(
NAME_EVERYWHERE,
"org.apache.cassandra.locator.EverywhereStrategy",
{},
std::nullopt, std::nullopt);
if (!db.has_keyspace(NAME_EVERYWHERE)) {
auto sde_mutations = service::prepare_new_keyspace_announcement(db.real_database(), sde_ksm, ts);
std::move(sde_mutations.begin(), sde_mutations.end(), std::back_inserter(mutations));
description += format(" create {} keyspace;", NAME_EVERYWHERE);
} else {
dlogger.info("{} keyspace is already present. Not creating", NAME_EVERYWHERE);
}
// Get mutations for creating and updating tables.
// Get mutations for creating tables.
auto num_keyspace_mutations = mutations.size();
co_await coroutine::parallel_for_each(ensured_tables(),
[this, &mutations, db, ts, sd_ksm, sde_ksm, workload_prioritization_enabled] (auto&& table) -> future<> {
auto ksm = table->ks_name() == NAME ? sd_ksm : sde_ksm;
// Ensure that the service_levels table contains new columns.
if (table->cf_name() == SERVICE_LEVELS) {
table = get_updated_service_levels(db, workload_prioritization_enabled);
}
[this, &mutations, db, ts, ksm] (auto&& table) -> future<> {
if (!db.has_schema(table->ks_name(), table->cf_name())) {
co_return co_await service::prepare_new_column_family_announcement(mutations, _sp, *ksm, std::move(table), ts);
}
// The service_levels table exists. Update it if it lacks new columns.
if (table->cf_name() == SERVICE_LEVELS && !get_current_service_levels(db)->equal_columns(*table)) {
auto update_mutations = co_await service::prepare_column_family_update_announcement(_sp, table, std::vector<view_ptr>(), ts);
std::move(update_mutations.begin(), update_mutations.end(), std::back_inserter(mutations));
}
});
if (mutations.size() > num_keyspace_mutations) {
description += " create and update system_distributed(_everywhere) tables";
@@ -324,15 +235,6 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
}
}
future<> system_distributed_keyspace::start_workload_prioritization() {
if (this_shard_id() != 0) {
co_return;
}
if (_qp.db().features().workload_prioritization) {
co_await create_tables({get_updated_service_levels(_qp.db(), true)});
}
}
future<> system_distributed_keyspace::start() {
if (this_shard_id() != 0) {
_started = true;
@@ -375,90 +277,6 @@ static db::consistency_level quorum_if_many(size_t num_token_owners) {
return num_token_owners > 1 ? db::consistency_level::QUORUM : db::consistency_level::ONE;
}
future<>
system_distributed_keyspace::insert_cdc_generation(
utils::UUID id,
const cdc::topology_description& desc,
context ctx) {
using namespace std::chrono_literals;
const size_t concurrency = 10;
const size_t num_replicas = ctx.num_token_owners;
// To insert the data quickly and efficiently we send it in batches of multiple rows
// (each batch represented by a single mutation). We also send multiple such batches concurrently.
// However, we need to limit the memory consumption of the operation.
// I assume that the memory consumption grows linearly with the number of replicas
// (we send to all replicas ``at the same time''), with the batch size (the data must
// be copied for each replica?) and with concurrency. These assumptions may be too conservative
// but that won't hurt in a significant way (it may hurt the efficiency of the operation a little).
// Thus, if we want to limit the memory consumption to L, it should be true that
// mutation_size * num_replicas * concurrency <= L, hence
// mutation_size <= L / (num_replicas * concurrency).
// For example, say L = 10MB, concurrency = 10, num_replicas = 100; we get
// mutation_size <= 10MB / 1000 = 10KB.
// On the other hand we must have mutation_size >= size of a single row,
// so we will use mutation_size <= max(size of single row, L/(num_replicas*concurrency)).
// It has been tested that sending 1MB batches to 3 replicas with concurrency 20 works OK,
// which would correspond to L ~= 60MB. Hence that's the limit we use here.
const size_t L = 60'000'000;
const auto mutation_size_threshold = std::max(size_t(1), L / (num_replicas * concurrency));
auto s = _qp.db().real_database().find_schema(
system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2);
auto ms = co_await cdc::get_cdc_generation_mutations_v2(s, id, desc, mutation_size_threshold, api::new_timestamp());
co_await max_concurrent_for_each(ms, concurrency, [&] (mutation& m) -> future<> {
co_await _sp.mutate(
{ std::move(m) },
db::consistency_level::ALL,
db::timeout_clock::now() + 60s,
nullptr, // trace_state
empty_service_permit(),
db::allow_per_partition_rate_limit::no,
false // raw_counters
);
});
}
future<std::optional<cdc::topology_description>>
system_distributed_keyspace::read_cdc_generation(utils::UUID id) {
utils::chunked_vector<cdc::token_range_description> entries;
size_t num_ranges = 0;
co_await _qp.query_internal(
// This should be a local read so 20s should be more than enough
format("SELECT range_end, streams, ignore_msb, num_ranges FROM {}.{} WHERE id = ? USING TIMEOUT 20s", NAME_EVERYWHERE, CDC_GENERATIONS_V2),
db::consistency_level::ONE, // we wrote the generation with ALL so ONE must see it (or there's something really wrong)
{ id },
1000, // for ~1KB rows, ~1MB page size
[&] (const cql3::untyped_result_set_row& row) {
std::vector<cdc::stream_id> streams;
row.get_list_data<bytes>("streams", std::back_inserter(streams));
entries.push_back(cdc::token_range_description{
dht::token::from_int64(row.get_as<int64_t>("range_end")),
std::move(streams),
uint8_t(row.get_as<int8_t>("ignore_msb"))});
num_ranges = row.get_as<int32_t>("num_ranges");
return make_ready_future<stop_iteration>(stop_iteration::no);
});
if (entries.empty()) {
co_return std::nullopt;
}
// Paranoic sanity check. Partial reads should not happen since generations should be retrieved only after they
// were written successfully with CL=ALL. But nobody uses EverywhereStrategy tables so they weren't ever properly
// tested, so just in case...
if (entries.size() != num_ranges) {
throw std::runtime_error(format(
"read_cdc_generation: wrong number of rows. The `num_ranges` column claimed {} rows,"
" but reading the partition returned {}.", num_ranges, entries.size()));
}
co_return std::optional{cdc::topology_description(std::move(entries))};
}
static future<utils::chunked_vector<mutation>> get_cdc_streams_descriptions_v2_mutation(
const replica::database& db,
db_clock::time_point time,
@@ -630,65 +448,83 @@ system_distributed_keyspace::cdc_current_generation_timestamp(context ctx) {
co_return timestamp_cql->one().get_as<db_clock::time_point>("time");
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
future<> system_distributed_keyspace::insert_snapshot_sstable(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, sstables::sstable_id sstable_id, dht::token first_token, dht::token last_token, sstring toc_name, sstring prefix, db::consistency_level cl) {
// Not inserting the downloaded column so that re-populating on restore
// retry doesn't overwrite downloaded=true set by a previous attempt
static const sstring query = format("INSERT INTO {}.{} (snapshot_name, \"keyspace\", \"table\", datacenter, rack, first_token, sstable_id, last_token, toc_name, prefix) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) USING TTL {}", NAME, SNAPSHOT_SSTABLES, SNAPSHOT_SSTABLES_TTL_SECONDS);
return _qp.execute_internal(
query,
cl,
internal_distributed_query_state(),
{ std::move(snapshot_name), std::move(ks), std::move(table), std::move(dc), std::move(rack),
dht::token::to_int64(first_token), sstable_id.uuid(), dht::token::to_int64(last_token), std::move(toc_name), std::move(prefix) },
cql3::query_processor::cache_internal::yes).discard_result();
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_level(sstring service_level_name) const {
return qos::get_service_level(_qp, NAME, SERVICE_LEVELS, service_level_name, db::consistency_level::ONE);
}
future<utils::chunked_vector<snapshot_sstable_entry>>
system_distributed_keyspace::get_snapshot_sstables(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, db::consistency_level cl, std::optional<dht::token> start_token, std::optional<dht::token> end_token) const {
utils::chunked_vector<snapshot_sstable_entry> sstables;
future<> system_distributed_keyspace::set_service_level(sstring service_level_name, qos::service_level_options slo) const {
static sstring prepared_query = format("INSERT INTO {}.{} (service_level) VALUES (?);", NAME, SERVICE_LEVELS);
co_await _qp.execute_internal(prepared_query, db::consistency_level::ONE, internal_distributed_query_state(), {service_level_name}, cql3::query_processor::cache_internal::no);
auto to_data_value = [&] (const qos::service_level_options::timeout_type& tv) {
return std::visit(overloaded_functor {
[&] (const qos::service_level_options::unset_marker&) {
return data_value::make_null(duration_type);
},
[&] (const qos::service_level_options::delete_marker&) {
return data_value::make_null(duration_type);
},
[&] (const lowres_clock::duration& d) {
return data_value(cql_duration(months_counter{0},
days_counter{0},
nanoseconds_counter{std::chrono::duration_cast<std::chrono::nanoseconds>(d).count()}));
},
}, tv);
static const sstring base_query = format("SELECT toc_name, prefix, sstable_id, first_token, last_token, downloaded FROM {}.{}"
" WHERE snapshot_name = ? AND \"keyspace\" = ? AND \"table\" = ? AND datacenter = ? AND rack = ?", NAME, SNAPSHOT_SSTABLES);
auto read_row = [&] (const cql3::untyped_result_set_row& row) {
sstables.emplace_back(sstables::sstable_id(row.get_as<utils::UUID>("sstable_id")), dht::token::from_int64(row.get_as<int64_t>("first_token")), dht::token::from_int64(row.get_as<int64_t>("last_token")), row.get_as<sstring>("toc_name"), row.get_as<sstring>("prefix"), is_downloaded(row.get_or<bool>("downloaded", false)));
return make_ready_future<stop_iteration>(stop_iteration::no);
};
auto to_data_value_g = [&] <typename T> (const std::variant<qos::service_level_options::unset_marker, qos::service_level_options::delete_marker, T>& v) {
return std::visit(overloaded_functor {
[&] (const qos::service_level_options::unset_marker&) {
return data_value::make_null(data_type_for<T>());
},
[&] (const qos::service_level_options::delete_marker&) {
return data_value::make_null(data_type_for<T>());
},
[&] (const T& v) {
return data_value(v);
},
}, v);
};
data_value workload = slo.workload == qos::service_level_options::workload_type::unspecified
? data_value::make_null(utf8_type)
: data_value(qos::service_level_options::to_string(slo.workload));
co_await _qp.execute_internal(format("UPDATE {}.{} SET timeout = ?, workload_type = ? WHERE service_level = ?;", NAME, SERVICE_LEVELS),
db::consistency_level::ONE,
internal_distributed_query_state(),
{to_data_value(slo.timeout),
workload,
service_level_name},
cql3::query_processor::cache_internal::no);
co_await _qp.execute_internal(format("UPDATE {}.{} SET shares = ? WHERE service_level = ?;", NAME, SERVICE_LEVELS),
db::consistency_level::ONE,
internal_distributed_query_state(),
{to_data_value_g(slo.shares), service_level_name},
cql3::query_processor::cache_internal::no);
if (start_token && end_token) {
co_await _qp.query_internal(
base_query + " AND first_token >= ? AND first_token <= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*start_token), dht::token::to_int64(*end_token) },
1000,
read_row);
} else if (start_token) {
co_await _qp.query_internal(
base_query + " AND first_token >= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*start_token) },
1000,
read_row);
} else if (end_token) {
co_await _qp.query_internal(
base_query + " AND first_token <= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*end_token) },
1000,
read_row);
} else {
co_await _qp.query_internal(
base_query,
cl,
{ snapshot_name, ks, table, dc, rack },
1000,
read_row);
}
co_return sstables;
}
future<> system_distributed_keyspace::drop_service_level(sstring service_level_name) const {
static sstring prepared_query = format("DELETE FROM {}.{} WHERE service_level= ?;", NAME, SERVICE_LEVELS);
return _qp.execute_internal(prepared_query, db::consistency_level::ONE, internal_distributed_query_state(), {service_level_name}, cql3::query_processor::cache_internal::no).discard_result();
future<> system_distributed_keyspace::update_sstable_download_status(sstring snapshot_name,
sstring ks,
sstring table,
sstring dc,
sstring rack,
sstables::sstable_id sstable_id,
dht::token start_token,
is_downloaded downloaded) const {
static const sstring update_query = format("UPDATE {}.{} USING TTL {} SET downloaded = ? WHERE snapshot_name = ? AND \"keyspace\" = ? AND \"table\" = ? AND "
"datacenter = ? AND rack = ? AND first_token = ? AND sstable_id = ?",
NAME,
SNAPSHOT_SSTABLES,
SNAPSHOT_SSTABLES_TTL_SECONDS);
co_await _qp.execute_internal(update_query,
consistency_level::ONE,
internal_distributed_query_state(),
{downloaded == is_downloaded::yes ? true : false, snapshot_name, ks, table, dc, rack, dht::token::to_int64(start_token), sstable_id.uuid()},
cql3::query_processor::cache_internal::no);
}
}
} // namespace db

View File

@@ -9,14 +9,17 @@
#pragma once
#include "schema/schema_fwd.hh"
#include "service/qos/qos_common.hh"
#include "utils/UUID.hh"
#include "cdc/generation_id.hh"
#include "utils/chunked_vector.hh"
#include "db/consistency_level_type.hh"
#include "locator/host_id.hh"
#include "dht/token.hh"
#include "sstables/types.hh"
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include <optional>
#include <unordered_map>
namespace cql3 {
@@ -24,7 +27,6 @@ class query_processor;
}
namespace cdc {
class stream_id;
class topology_description;
class streams_version;
} // namespace cdc
@@ -34,23 +36,27 @@ namespace service {
class migration_manager;
}
namespace db {
using is_downloaded = bool_class<class is_downloaded_tag>;
struct snapshot_sstable_entry {
sstables::sstable_id sstable_id;
dht::token first_token;
dht::token last_token;
sstring toc_name;
sstring prefix;
is_downloaded downloaded{is_downloaded::no};
};
class system_distributed_keyspace {
public:
static constexpr auto NAME = "system_distributed";
static constexpr auto NAME_EVERYWHERE = "system_distributed_everywhere";
static constexpr auto VIEW_BUILD_STATUS = "view_build_status";
static constexpr auto SERVICE_LEVELS = "service_levels";
/* Nodes use this table to communicate new CDC stream generations to other nodes. */
static constexpr auto CDC_TOPOLOGY_DESCRIPTION = "cdc_generation_descriptions";
/* Nodes use this table to communicate new CDC stream generations to other nodes.
* Resides in system_distributed_everywhere. */
static constexpr auto CDC_GENERATIONS_V2 = "cdc_generation_descriptions_v2";
/* This table is used by CDC clients to learn about available CDC streams. */
static constexpr auto CDC_DESC_V2 = "cdc_streams_descriptions_v2";
@@ -62,6 +68,12 @@ public:
* in the old table also appear in the new table, if necessary. */
static constexpr auto CDC_DESC_V1 = "cdc_streams_descriptions";
/* This table is used by the backup and restore code to store per-sstable metadata.
* The data the coordinator node puts in this table comes from the snapshot manifests. */
static constexpr auto SNAPSHOT_SSTABLES = "snapshot_sstables";
static constexpr uint64_t SNAPSHOT_SSTABLES_TTL_SECONDS = std::chrono::seconds(std::chrono::days(3)).count();
/* Information required to modify/query some system_distributed tables, passed from the caller. */
struct context {
/* How many different token owners (endpoints) are there in the token ring? */
@@ -77,19 +89,14 @@ private:
public:
static std::vector<schema_ptr> all_distributed_tables();
static std::vector<schema_ptr> all_everywhere_tables();
system_distributed_keyspace(cql3::query_processor&, service::migration_manager&, service::storage_proxy&);
future<> start();
future<> start_workload_prioritization();
future<> stop();
bool started() const { return _started; }
future<> insert_cdc_generation(utils::UUID, const cdc::topology_description&, context);
future<std::optional<cdc::topology_description>> read_cdc_generation(utils::UUID);
future<> create_cdc_desc(db_clock::time_point, const cdc::topology_description&, context);
future<bool> cdc_desc_exists(db_clock::time_point, context);
@@ -105,10 +112,25 @@ public:
// NOTE: currently used only by alternator
future<db_clock::time_point> cdc_current_generation_timestamp(context);
future<qos::service_levels_info> get_service_levels(qos::query_context ctx) const;
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;
/* Inserts a single SSTable entry for a given snapshot, keyspace, table, datacenter,
* and rack. The row is written with the specified TTL (in seconds). Uses consistency
* level `EACH_QUORUM` by default.*/
future<> insert_snapshot_sstable(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, sstables::sstable_id sstable_id, dht::token first_token, dht::token last_token, sstring toc_name, sstring prefix, db::consistency_level cl = db::consistency_level::EACH_QUORUM);
/* Retrieves all SSTable entries for a given snapshot, keyspace, table, datacenter, and rack.
* If `start_token` and `end_token` are provided, only entries whose `first_token` is in the range [`start_token`, `end_token`] will be returned.
* Returns a vector of `snapshot_sstable_entry` structs containing `sstable_id`, `first_token`, `last_token`,
* `toc_name`, and `prefix`. Uses consistency level `LOCAL_QUORUM` by default. */
future<utils::chunked_vector<snapshot_sstable_entry>> get_snapshot_sstables(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, db::consistency_level cl = db::consistency_level::LOCAL_QUORUM, std::optional<dht::token> start_token = std::nullopt, std::optional<dht::token> end_token = std::nullopt) const;
future<> update_sstable_download_status(sstring snapshot_name,
sstring ks,
sstring table,
sstring dc,
sstring rack,
sstables::sstable_id sstable_id,
dht::token start_token,
is_downloaded downloaded) const;
private:
future<> create_tables(std::vector<schema_ptr> tables);

View File

@@ -1146,7 +1146,8 @@ schema_ptr system_keyspace::sstables_registry() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, SSTABLES_REGISTRY);
return schema_builder(NAME, SSTABLES_REGISTRY, id)
.with_column("owner", uuid_type, column_kind::partition_key)
.with_column("table_id", uuid_type, column_kind::partition_key)
.with_column("node_owner", uuid_type, column_kind::partition_key)
.with_column("generation", timeuuid_type, column_kind::clustering_key)
.with_column("status", utf8_type)
.with_column("state", utf8_type)
@@ -1309,6 +1310,7 @@ schema_ptr system_keyspace::view_building_tasks() {
return schema_builder(NAME, VIEW_BUILDING_TASKS, std::make_optional(id))
.with_column("key", utf8_type, column_kind::partition_key)
.with_column("id", timeuuid_type, column_kind::clustering_key)
.with_column("min_task_id", timeuuid_type, column_kind::static_column)
.with_column("type", utf8_type)
.with_column("aborted", boolean_type)
.with_column("base_id", uuid_type)
@@ -2749,12 +2751,36 @@ future<mutation> system_keyspace::make_remove_view_build_status_on_host_mutation
static constexpr auto VIEW_BUILDING_KEY = "view_building";
future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
static const sstring query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
future<std::pair<db::view::building_tasks, std::optional<utils::UUID>>> system_keyspace::get_view_building_tasks() {
using namespace db::view;
// When the VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, read the static
// column min_task_id first and use it as a lower bound for the clustering row
// scan. This skips tombstoned rows below the boundary, avoiding dead-cell
// warnings from the tombstone_warn_threshold check.
std::optional<utils::UUID> min_task_id;
if (_db.features().view_building_tasks_min_task_id) {
auto schema = view_building_tasks();
auto pk = partition_key::from_single_value(*schema, data_value(VIEW_BUILDING_KEY).serialize_nonnull());
auto dk = dht::decorate_key(*schema, pk);
auto col_id = schema->get_column_definition("min_task_id")->id;
query::partition_slice slice(
query::clustering_row_ranges{},
{col_id},
{},
query::partition_slice::option_set::of<query::partition_slice::option::always_return_static_content>());
auto cmd = query::read_command(schema->id(), schema->version(), slice,
_db.get_query_max_result_size(), query::tombstone_limit::max);
auto [qr, _cache_temp] = co_await _db.query(schema, cmd, query::result_options::only_result(),
{dht::partition_range::make_singular(dk)}, nullptr, db::no_timeout);
auto rs = query::result_set::from_raw_result(schema, slice, *qr);
if (!rs.empty()) {
min_task_id = rs.row(0).get<utils::UUID>("min_task_id");
}
}
building_tasks tasks;
co_await _qp.query_internal(query, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto process_row = [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto id = row.get_as<utils::UUID>("id");
auto type = task_type_from_string(row.get_as<sstring>("type"));
auto aborted = row.get_as<bool>("aborted");
@@ -2779,8 +2805,18 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
break;
}
co_return stop_iteration::no;
});
co_return tasks;
};
if (min_task_id) {
static const sstring bounded_query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}' AND id >= ?",
NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
co_await _qp.query_internal(bounded_query, db::consistency_level::LOCAL_ONE, {*min_task_id}, 1000, std::move(process_row));
} else {
static const sstring full_query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'",
NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
co_await _qp.query_internal(full_query, std::move(process_row));
}
co_return std::pair{std::move(tasks), std::move(min_task_id)};
}
future<mutation> system_keyspace::make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task) {
@@ -3473,37 +3509,37 @@ system_keyspace::read_cdc_generation_opt(utils::UUID id) {
co_return cdc::topology_description{std::move(entries)};
}
future<> system_keyspace::sstables_registry_create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) {
static const auto req = format("INSERT INTO system.{} (owner, generation, status, state, version, format) VALUES (?, ?, ?, ?, ?, ?)", SSTABLES_REGISTRY);
slogger.trace("Inserting {}.{} into {}", owner, desc.generation, SSTABLES_REGISTRY);
co_await execute_cql(req, owner.id, desc.generation, status, sstables::state_to_dir(state), fmt::to_string(desc.version), fmt::to_string(desc.format)).discard_result();
future<> system_keyspace::sstables_registry_create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) {
static const auto req = format("INSERT INTO system.{} (table_id, node_owner, generation, status, state, version, format) VALUES (?, ?, ?, ?, ?, ?, ?)", SSTABLES_REGISTRY);
slogger.trace("Inserting {}.{}.{} into {}", tid, node_owner, desc.generation, SSTABLES_REGISTRY);
co_await execute_cql(req, tid.id, node_owner.uuid(), desc.generation, status, sstables::state_to_dir(state), fmt::to_string(desc.version), fmt::to_string(desc.format)).discard_result();
}
future<> system_keyspace::sstables_registry_update_entry_status(table_id owner, sstables::generation_type gen, sstring status) {
static const auto req = format("UPDATE system.{} SET status = ? WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Updating {}.{} -> status={} in {}", owner, gen, status, SSTABLES_REGISTRY);
co_await execute_cql(req, status, owner.id, gen).discard_result();
future<> system_keyspace::sstables_registry_update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status) {
static const auto req = format("UPDATE system.{} SET status = ? WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Updating {}.{}.{} -> status={} in {}", tid, node_owner, gen, status, SSTABLES_REGISTRY);
co_await execute_cql(req, status, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state) {
static const auto req = format("UPDATE system.{} SET state = ? WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
future<> system_keyspace::sstables_registry_update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state) {
static const auto req = format("UPDATE system.{} SET state = ? WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
auto new_state = sstables::state_to_dir(state);
slogger.trace("Updating {}.{} -> state={} in {}", owner, gen, new_state, SSTABLES_REGISTRY);
co_await execute_cql(req, new_state, owner.id, gen).discard_result();
slogger.trace("Updating {}.{}.{} -> state={} in {}", tid, node_owner, gen, new_state, SSTABLES_REGISTRY);
co_await execute_cql(req, new_state, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_delete_entry(table_id owner, sstables::generation_type gen) {
static const auto req = format("DELETE FROM system.{} WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Removing {}.{} from {}", owner, gen, SSTABLES_REGISTRY);
co_await execute_cql(req, owner.id, gen).discard_result();
future<> system_keyspace::sstables_registry_delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen) {
static const auto req = format("DELETE FROM system.{} WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Removing {}.{}.{} from {}", tid, node_owner, gen, SSTABLES_REGISTRY);
co_await execute_cql(req, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_list(table_id owner, sstable_registry_entry_consumer consumer) {
static const auto req = format("SELECT status, state, generation, version, format FROM system.{} WHERE owner = ?", SSTABLES_REGISTRY);
slogger.trace("Listing {} entries from {}", owner, SSTABLES_REGISTRY);
future<> system_keyspace::sstables_registry_list(table_id tid, locator::host_id node_owner, sstable_registry_entry_consumer consumer) {
static const auto req = format("SELECT status, state, generation, version, format FROM system.{} WHERE table_id = ? AND node_owner = ?", SSTABLES_REGISTRY);
slogger.trace("Listing {}.{} entries from {}", tid, node_owner, SSTABLES_REGISTRY);
co_await _qp.query_internal(req, db::consistency_level::ONE, { owner.id }, 1000, [ consumer = std::move(consumer) ] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
co_await _qp.query_internal(req, db::consistency_level::ONE, { tid.id, node_owner.uuid() }, 1000, [ consumer = std::move(consumer) ] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto status = row.get_as<sstring>("status");
auto state = sstables::state_from_dir(row.get_as<sstring>("state"));
auto gen = sstables::generation_type(row.get_as<utils::UUID>("generation"));

View File

@@ -572,7 +572,7 @@ public:
future<mutation> make_remove_view_build_status_on_host_mutation(api::timestamp_type ts, system_keyspace_view_name view_name, locator::host_id host_id);
// system.view_building_tasks
future<db::view::building_tasks> get_view_building_tasks();
future<std::pair<db::view::building_tasks, std::optional<utils::UUID>>> get_view_building_tasks();
future<mutation> make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task);
future<mutation> make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id);
@@ -671,12 +671,12 @@ public:
future<mutation> make_view_builder_version_mutation(api::timestamp_type ts, view_builder_version_t version);
future<view_builder_version_t> get_view_builder_version();
future<> sstables_registry_create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc);
future<> sstables_registry_update_entry_status(table_id owner, sstables::generation_type gen, sstring status);
future<> sstables_registry_update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state);
future<> sstables_registry_delete_entry(table_id owner, sstables::generation_type gen);
future<> sstables_registry_create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc);
future<> sstables_registry_update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status);
future<> sstables_registry_update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state);
future<> sstables_registry_delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen);
using sstable_registry_entry_consumer = sstables::sstables_registry::entry_consumer;
future<> sstables_registry_list(table_id owner, sstable_registry_entry_consumer consumer);
future<> sstables_registry_list(table_id tid, locator::host_id node_owner, sstable_registry_entry_consumer consumer);
future<std::optional<sstring>> load_group0_upgrade_state();
future<> save_group0_upgrade_state(sstring);

View File

@@ -15,24 +15,24 @@ class system_keyspace_sstables_registry : public sstables::sstables_registry {
public:
system_keyspace_sstables_registry(system_keyspace& keyspace) : _keyspace(keyspace.shared_from_this()) {}
virtual seastar::future<> create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) override {
return _keyspace->sstables_registry_create_entry(owner, status, state, desc);
virtual seastar::future<> create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) override {
return _keyspace->sstables_registry_create_entry(tid, node_owner, status, state, desc);
}
virtual seastar::future<> update_entry_status(table_id owner, sstables::generation_type gen, sstring status) override {
return _keyspace->sstables_registry_update_entry_status(owner, gen, status);
virtual seastar::future<> update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status) override {
return _keyspace->sstables_registry_update_entry_status(tid, node_owner, gen, status);
}
virtual seastar::future<> update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state) override {
return _keyspace->sstables_registry_update_entry_state(owner, gen, state);
virtual seastar::future<> update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state) override {
return _keyspace->sstables_registry_update_entry_state(tid, node_owner, gen, state);
}
virtual seastar::future<> delete_entry(table_id owner, sstables::generation_type gen) override {
return _keyspace->sstables_registry_delete_entry(owner, gen);
virtual seastar::future<> delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen) override {
return _keyspace->sstables_registry_delete_entry(tid, node_owner, gen);
}
virtual seastar::future<> sstables_registry_list(table_id owner, entry_consumer consumer) override {
return _keyspace->sstables_registry_list(owner, std::move(consumer));
virtual seastar::future<> sstables_registry_list(table_id tid, locator::host_id node_owner, entry_consumer consumer) override {
return _keyspace->sstables_registry_list(tid, node_owner, std::move(consumer));
}
};

View File

@@ -10,6 +10,7 @@
#include "db/view/view_update_backlog.hh"
#include "utils/error_injection.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/cacheline.hh>
#include <seastar/core/future.hh>
@@ -41,13 +42,16 @@ class node_update_backlog {
std::chrono::milliseconds _interval;
std::atomic<clock::time_point> _last_update;
std::atomic<update_backlog> _max;
utils::updateable_value<uint32_t> _view_flow_control_delay_limit_in_ms;
public:
explicit node_update_backlog(size_t shards, std::chrono::milliseconds interval)
explicit node_update_backlog(size_t shards, std::chrono::milliseconds interval,
utils::updateable_value<uint32_t> view_flow_control_delay_limit_in_ms = utils::updateable_value<uint32_t>(1000))
: _backlogs(shards)
, _interval(interval)
, _last_update(clock::now() - _interval)
, _max(update_backlog::no_backlog()) {
, _max(update_backlog::no_backlog())
, _view_flow_control_delay_limit_in_ms(std::move(view_flow_control_delay_limit_in_ms)) {
if (utils::get_local_injector().enter("update_backlog_immediately")) {
_interval = std::chrono::milliseconds(0);
_last_update = clock::now();
@@ -59,6 +63,9 @@ public:
update_backlog fetch_shard(unsigned shard);
seastar::future<std::optional<update_backlog>> fetch_if_changed();
std::chrono::microseconds calculate_throttling_delay(update_backlog backlog,
db::timeout_clock::time_point timeout) const;
// Exposed for testing only.
update_backlog load() const {
return _max.load(std::memory_order_relaxed);

View File

@@ -150,14 +150,14 @@ row_locker::unlock(const dht::decorated_key* pk, bool partition_exclusive,
auto pli = _two_level_locks.find(*pk);
if (pli == _two_level_locks.end()) {
// This shouldn't happen... We can't unlock this lock if we can't find it...
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for partition", *pk);
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for partition {}", *pk);
return;
}
SCYLLA_ASSERT(&pli->first == pk);
if (cpk) {
auto rli = pli->second._row_locks.find(*cpk);
if (rli == pli->second._row_locks.end()) {
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for row", *cpk);
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for row {}", *cpk);
return;
}
SCYLLA_ASSERT(&rli->first == cpk);

View File

@@ -45,6 +45,7 @@
#include "db/view/view_builder.hh"
#include "db/view/view_updating_consumer.hh"
#include "db/view/view_update_generator.hh"
#include "db/view/node_view_update_backlog.hh"
#include "db/view/regular_column_transformation.hh"
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
@@ -3492,18 +3493,27 @@ future<> delete_ghost_rows_visitor::do_accept_new_row(partition_key pk, clusteri
}
}
std::chrono::microseconds calculate_view_update_throttling_delay(db::view::update_backlog backlog,
db::timeout_clock::time_point timeout,
uint32_t view_flow_control_delay_limit_in_ms) {
// View updates are asynchronous, and because of this limiting their concurrency requires
// a special approach. The current algorithm places all of the pending view updates in the backlog
// and artificially slows down new responses to coordinator requests based on how full the backlog is.
// This function calculates how much a request should be slowed down based on the backlog's fullness.
// The equation is basically: delay(in seconds) = view_fullness_ratio^3
// The more full the backlog gets the more aggressively the requests are slowed down.
// The delay is limited to the amount of time left until timeout.
// After the timeout the request fails, so there's no point in waiting longer than that.
// The second argument defines this timeout point - we can't delay the request more than this time point.
// See: https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/
std::chrono::microseconds node_update_backlog::calculate_throttling_delay(update_backlog backlog,
db::timeout_clock::time_point timeout) const {
auto adjust = [] (float x) { return x * x * x; };
auto budget = std::max(service::storage_proxy::clock_type::duration(0),
timeout - service::storage_proxy::clock_type::now());
std::chrono::microseconds ret(uint32_t(adjust(backlog.relative_size()) * view_flow_control_delay_limit_in_ms * 1000));
auto budget = std::max(db::timeout_clock::duration(0),
timeout - db::timeout_clock::now());
std::chrono::microseconds ret(uint32_t(adjust(backlog.relative_size()) * _view_flow_control_delay_limit_in_ms() * 1000));
// "budget" has millisecond resolution and can potentially be long
// in the future so converting it to microseconds may overflow.
// So to compare buget and ret we need to convert both to the lower
// resolution.
if (std::chrono::duration_cast<service::storage_proxy::clock_type::duration>(ret) < budget) {
if (std::chrono::duration_cast<db::timeout_clock::duration>(ret) < budget) {
return ret;
} else {
// budget is small (< ret) so can be converted to microseconds

View File

@@ -11,6 +11,7 @@
#include <exception>
#include <ranges>
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/core/on_internal_error.hh>
#include "db/view/view_building_coordinator.hh"
@@ -179,7 +180,10 @@ future<> view_building_coordinator::clean_finished_tasks() {
co_return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
// Collect tasks eligible for deletion: must still be in state and not aborted.
std::vector<utils::UUID> tasks_to_delete;
for (auto& [replica, tasks]: _finished_tasks) {
for (auto& task_id: tasks) {
// The task might be aborted in the meantime. In this case we cannot remove it because we need it to create a new task.
@@ -189,15 +193,65 @@ future<> view_building_coordinator::clean_finished_tasks() {
// If yes, we can just remove it instead of aborting it.
auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, task_id);
if (task_opt && !task_opt->get().aborted) {
builder.del_task(task_id);
vbc_logger.debug("Removing finished task with ID: {}", task_id);
tasks_to_delete.push_back(task_id);
}
}
}
co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
for (auto& [_, tasks_set]: _finished_tasks) {
tasks_set.clear();
if (!tasks_to_delete.empty()) {
// Find the minimum UUID (by timeuuid ordering) among tasks that are NOT being
// deleted — i.e., alive tasks that must remain in the table.
// Everything strictly below this boundary is safe to cover with one range tombstone.
const std::unordered_set<utils::UUID> to_delete_set(tasks_to_delete.begin(), tasks_to_delete.end());
std::optional<utils::UUID> min_alive_uuid;
for (auto& [base_id, base_tasks] : _vb_sm.building_state.tasks_state) {
for (auto& [replica, rep_tasks] : base_tasks) {
auto check = [&](const utils::UUID& id) {
if (!to_delete_set.contains(id)
&& (!min_alive_uuid || timeuuid_tri_compare(id, *min_alive_uuid) < 0)) {
min_alive_uuid = id;
}
};
for (auto& [id, task] : rep_tasks.staging_tasks) {
check(id);
}
for (auto& [view_id, task_m] : rep_tasks.view_tasks) {
for (auto& [id, task] : task_m) {
check(id);
}
}
co_await coroutine::maybe_yield();
}
}
if (min_alive_uuid) {
vbc_logger.debug("Removing finished tasks before ID: {} using range tombstone", *min_alive_uuid);
builder.del_tasks_before(*min_alive_uuid);
for (auto& task_id : tasks_to_delete) {
// Tasks below min_alive_uuid are already covered by the range tombstone.
if (timeuuid_tri_compare(task_id, *min_alive_uuid) < 0) {
continue;
}
vbc_logger.debug("Removing finished task with ID: {}", task_id);
builder.del_task(task_id);
}
} else {
// No alive tasks remain — one range tombstone covers everything.
vbc_logger.debug("No alive tasks remain, removing all finished tasks using range tombstone");
builder.del_all_tasks();
}
if (_db.features().view_building_tasks_min_task_id) {
// If min_alive_uuid == std::nullopt, set min_task_id to a fresh UUID,
// so future scans start past all the just-deleted rows (new tasks created
// later will have larger UUIDs).
builder.set_min_task_id(min_alive_uuid ? *min_alive_uuid : utils::UUID_gen::get_time_UUID());
}
co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
for (auto& [_, tasks_set]: _finished_tasks) {
tasks_set.clear();
}
}
}
@@ -533,7 +587,7 @@ void view_building_coordinator::generate_tablet_migration_updates(utils::chunked
}
auto last_token = tmap.get_last_token(gid.tablet);
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto create_task_copy_on_pending_replica = [&] (const view_building_task& task) {
auto new_id = builder.new_id();
@@ -601,7 +655,7 @@ void view_building_coordinator::generate_tablet_resize_updates(utils::chunked_ve
return;
}
bool is_split = old_tmap.tablet_count() < new_tmap.tablet_count();
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto create_task_copy = [&] (const view_building_task& task, dht::token last_token) -> utils::UUID {
auto new_id = builder.new_id();
@@ -671,7 +725,7 @@ void view_building_coordinator::abort_tasks(utils::chunked_vector<canonical_muta
}
vbc_logger.debug("Generating abort mutations for tasks for table {}", table_id);
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto abort_task_map = [&] (const task_map& task_map) {
for (auto& [id, _]: task_map) {
vbc_logger.debug("Aborting task {}", id);
@@ -700,7 +754,7 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,
}
vbc_logger.debug("Generating abort mutations for tasks for table {} on replica {} and last token {}", table_id, replica, last_token);
view_building_task_mutation_builder builder(write_timestamp);
view_building_task_mutation_builder builder(write_timestamp, vb_sm.building_state.make_task_uuid_generator(write_timestamp));
auto abort_task_map = [&] (const task_map& task_map) {
for (auto& [id, task]: task_map) {
if (task.last_token == last_token) {
@@ -742,7 +796,7 @@ void view_building_coordinator::rollback_aborted_tasks(utils::chunked_vector<can
return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto& base_tasks = _vb_sm.building_state.tasks_state.at(table_id);
for (auto& [_, replica_tasks]: base_tasks) {
for (auto& [_, building_task_map]: replica_tasks.view_tasks) {
@@ -759,7 +813,7 @@ void view_building_coordinator::rollback_aborted_tasks(utils::chunked_vector<can
return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto& replica_tasks = _vb_sm.building_state.tasks_state.at(table_id).at(replica);
for (auto& [_, building_task_map]: replica_tasks.view_tasks) {
rollback_task_map(builder, building_task_map);

View File

@@ -8,6 +8,7 @@
*/
#include "db/view/view_building_state.hh"
#include "utils/UUID_gen.hh"
namespace db {
@@ -22,9 +23,10 @@ view_building_task::view_building_task(utils::UUID id, task_type type, bool abor
, replica(replica)
, last_token(last_token) {}
view_building_state::view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table)
view_building_state::view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table, std::optional<utils::UUID> min_alive_uuid)
: tasks_state(std::move(tasks_state))
, currently_processed_base_table(std::move(processed_base_table)) {}
, currently_processed_base_table(std::move(processed_base_table))
, min_alive_uuid(std::move(min_alive_uuid)) {}
views_state::views_state(std::map<table_id, std::vector<table_id>> views_per_base, view_build_status_map status_map)
: views_per_base(std::move(views_per_base))
@@ -127,6 +129,24 @@ std::map<dht::token, std::vector<view_building_task>> view_building_state::colle
return tasks;
}
task_uuid_generator::task_uuid_generator(api::timestamp_type base_ts)
: _next_ts(base_ts) {}
utils::UUID task_uuid_generator::operator()() {
return utils::UUID_gen::get_random_time_UUID_from_micros(
std::chrono::microseconds{_next_ts++});
}
task_uuid_generator view_building_state::make_task_uuid_generator(api::timestamp_type ts) const {
if (min_alive_uuid) {
auto lower_bound = utils::UUID_gen::micros_timestamp(*min_alive_uuid);
if (ts <= lower_bound) {
ts = lower_bound + 1;
}
}
return task_uuid_generator{ts};
}
}
}

View File

@@ -14,6 +14,7 @@
#include "db/view/view_build_status.hh"
#include "locator/host_id.hh"
#include "locator/tablets.hh"
#include "mutation/timestamp.hh"
#include "utils/UUID.hh"
#include <fmt/base.h>
#include "schema/schema_fwd.hh"
@@ -64,6 +65,16 @@ struct replica_tasks {
using base_table_tasks = std::map<locator::tablet_replica, replica_tasks>;
using building_tasks = std::map<table_id, base_table_tasks>;
// Generates unique timeuuids with strictly increasing microsecond timestamps.
// Each call to operator() returns a new timeuuid whose timestamp is one
// microsecond greater than the previous one.
class task_uuid_generator {
api::timestamp_type _next_ts;
public:
explicit task_uuid_generator(api::timestamp_type base_ts);
utils::UUID operator()();
};
// Represents cluster-wide view building state (only for tablet-based views).
// The state stores all unfinished view building tasks for all tablet-based views
// and table_id of currently processed base table by view building coordinator.
@@ -73,14 +84,22 @@ using building_tasks = std::map<table_id, base_table_tasks>;
struct view_building_state {
building_tasks tasks_state;
std::optional<table_id> currently_processed_base_table;
std::optional<utils::UUID> min_alive_uuid;
view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table);
view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table, std::optional<utils::UUID> min_alive_uuid);
view_building_state() = default;
std::optional<std::reference_wrapper<const view_building_task>> get_task(table_id base_id, locator::tablet_replica replica, utils::UUID id) const;
std::vector<std::reference_wrapper<const view_building_task>> get_tasks_for_host(table_id base_id, locator::host_id host) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id, const locator::tablet_replica& replica) const;
// Creates a generator that produces unique timeuuids suitable for view
// building task IDs. The generated uuids have strictly increasing
// microsecond timestamps starting from write_timestamp. If min_alive_uuid
// is set, all generated uuids are guaranteed to be greater than
// *min_alive_uuid in timeuuid order.
task_uuid_generator make_task_uuid_generator(api::timestamp_type write_timestamp) const;
};
// Represents global state of tablet-based views.

View File

@@ -14,7 +14,7 @@ namespace db {
namespace view {
utils::UUID view_building_task_mutation_builder::new_id() {
return utils::UUID_gen::get_time_UUID();
return _uuid_gen();
}
clustering_key view_building_task_mutation_builder::get_ck(utils::UUID id) {
@@ -52,6 +52,30 @@ view_building_task_mutation_builder& view_building_task_mutation_builder::del_ta
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::del_tasks_before(utils::UUID id) {
auto ck = get_ck(id);
range_tombstone rt(
position_in_partition::before_all_clustered_rows(),
position_in_partition_view(ck, bound_weight::before_all_prefixed),
tombstone{_ts, gc_clock::now()});
_m.partition().apply_row_tombstone(*_s, std::move(rt));
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::del_all_tasks() {
range_tombstone rt(
position_in_partition::before_all_clustered_rows(),
position_in_partition::after_all_clustered_rows(),
tombstone{_ts, gc_clock::now()});
_m.partition().apply_row_tombstone(*_s, std::move(rt));
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::set_min_task_id(utils::UUID id) {
_m.set_static_cell("min_task_id", data_value(id), _ts);
return *this;
}
}
}

View File

@@ -8,6 +8,7 @@
#pragma once
#include "db/view/view_building_state.hh"
#include "mutation/mutation.hh"
#include "db/system_keyspace.hh"
#include "mutation/timestamp.hh"
@@ -19,17 +20,19 @@ namespace view {
// Factory for mutations to `system.view_building_tasks` table.
class view_building_task_mutation_builder {
api::timestamp_type _ts;
task_uuid_generator _uuid_gen;
schema_ptr _s;
mutation _m;
public:
view_building_task_mutation_builder(api::timestamp_type ts)
view_building_task_mutation_builder(api::timestamp_type ts, task_uuid_generator uuid_gen)
: _ts(ts)
, _uuid_gen(std::move(uuid_gen))
, _s(db::system_keyspace::view_building_tasks())
, _m(_s, partition_key::from_single_value(*_s, data_value("view_building").serialize_nonnull()))
{ }
static utils::UUID new_id();
utils::UUID new_id();
view_building_task_mutation_builder& set_type(utils::UUID id, db::view::view_building_task::task_type type);
view_building_task_mutation_builder& set_aborted(utils::UUID id, bool aborted);
@@ -38,6 +41,12 @@ public:
view_building_task_mutation_builder& set_last_token(utils::UUID id, dht::token last_token);
view_building_task_mutation_builder& set_replica(utils::UUID id, const locator::tablet_replica& replica);
view_building_task_mutation_builder& del_task(utils::UUID id);
// Deletes all tasks with clustering key < id using a range tombstone.
view_building_task_mutation_builder& del_tasks_before(utils::UUID id);
// Deletes all tasks using a range tombstone covering the entire clustering range.
view_building_task_mutation_builder& del_all_tasks();
// Sets the static column min_task_id to `id`.
view_building_task_mutation_builder& set_min_task_id(utils::UUID id);
mutation build() {
return std::move(_m);

View File

@@ -275,11 +275,12 @@ future<> view_building_worker::create_staging_sstable_tasks() {
utils::chunked_vector<canonical_mutation> cmuts;
auto guard = co_await _group0.client().start_operation(_as);
auto uuid_gen = _vb_state_machine.building_state.make_task_uuid_generator(guard.write_timestamp());
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
for (auto& [table_id, sst_infos]: _sstables_to_register) {
for (auto& sst_info: sst_infos) {
view_building_task task {
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
uuid_gen(), view_building_task::task_type::process_staging, false,
table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
};
auto mut = co_await _sys_ks.make_view_building_task_mutation(guard.write_timestamp(), task);
@@ -715,7 +716,7 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
vbw_logger.info("Building range {} for base table {} and views {} was aborted.", range, base_id, views_ids);
} catch (...) {
eptr = std::current_exception();
vbw_logger.warn("Error during processing range {} for base table {} and views {}: ", range, base_id, views_ids, eptr);
vbw_logger.warn("Error during processing range {} for base table {} and views {}: {}", range, base_id, views_ids, eptr);
}
reader.close().get();

View File

@@ -43,7 +43,7 @@ public:
// Returns the number of bytes in the backlog divided by the maximum number of bytes
// that the backlog can hold before employing admission control. While the backlog
// is below the threshold, the coordinator will slow down the view updates up to
// calculate_view_update_throttling_delay()::delay_limit_us. Above the threshold,
// node_update_backlog::calculate_throttling_delay()::delay_limit_us. Above the threshold,
// the coordinator will reject the writes that would increase the backlog. On the
// replica, the writes will start failing only after reaching the hard limit '_max'.
float relative_size() const {
@@ -70,18 +70,4 @@ public:
}
};
// View updates are asynchronous, and because of this limiting their concurrency requires
// a special approach. The current algorithm places all of the pending view updates in the backlog
// and artificially slows down new responses to coordinator requests based on how full the backlog is.
// This function calculates how much a request should be slowed down based on the backlog's fullness.
// The equation is basically: delay(in seconds) = view_fullness_ratio^3
// The more full the backlog gets the more aggressively the requests are slowed down.
// The delay is limited to the amount of time left until timeout.
// After the timeout the request fails, so there's no point in waiting longer than that.
// The second argument defines this timeout point - we can't delay the request more than this time point.
// See: https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/
std::chrono::microseconds calculate_view_update_throttling_delay(
update_backlog backlog,
db::timeout_clock::time_point timeout,
uint32_t view_flow_control_delay_limit_in_ms);
}

View File

@@ -7,6 +7,7 @@
*/
#include "db/view/view_update_backlog.hh"
#include "db/view/node_view_update_backlog.hh"
#include <seastar/core/timed_out_error.hh>
#include "gms/inet_address.hh"
#include <seastar/util/defer.hh>
@@ -95,9 +96,10 @@ public:
}
};
view_update_generator::view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, abort_source& as)
view_update_generator::view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, node_update_backlog& node_backlog, abort_source& as)
: _db(db)
, _proxy(proxy)
, _node_update_backlog(node_backlog)
, _progress_tracker(std::make_unique<progress_tracker>())
, _early_abort_subscription(as.subscribe([this] () noexcept { do_abort(); }))
{
@@ -112,7 +114,7 @@ future<> view_update_generator::start() {
_started = seastar::async([this]() mutable {
auto drop_sstable_references = defer([&] () noexcept {
// Clear sstable references so sstables_manager::stop() doesn't hang.
vug_logger.info("leaving {} unstaged sstables unprocessed",
vug_logger.info("leaving {} unstaged sstables and {} sstables with tables unprocessed",
_sstables_to_move.size(), _sstables_with_tables.size());
_sstables_to_move.clear();
_sstables_with_tables.clear();
@@ -498,7 +500,7 @@ future<> view_update_generator::generate_and_propagate_view_updates(const replic
// the one which limits the number of incoming client requests by delaying the response to the client.
if (batch_num > 0) {
update_backlog local_backlog = _db.get_view_update_backlog();
std::chrono::microseconds throttle_delay = calculate_view_update_throttling_delay(local_backlog, timeout, _db.get_config().view_flow_control_delay_limit_in_ms());
std::chrono::microseconds throttle_delay = _node_update_backlog.calculate_throttling_delay(local_backlog, timeout);
co_await seastar::sleep(throttle_delay);

View File

@@ -52,6 +52,7 @@ using allow_hints = bool_class<allow_hints_tag>;
namespace db::view {
class node_update_backlog;
class stats;
struct wait_for_all_updates_tag {};
using wait_for_all_updates = bool_class<wait_for_all_updates_tag>;
@@ -63,6 +64,7 @@ public:
private:
replica::database& _db;
sharded<service::storage_proxy>& _proxy;
node_update_backlog& _node_update_backlog;
seastar::abort_source _as;
future<> _started = make_ready_future<>();
seastar::condition_variable _pending_sstables;
@@ -75,7 +77,7 @@ private:
optimized_optional<abort_source::subscription> _early_abort_subscription;
void do_abort() noexcept;
public:
view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, abort_source& as);
view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, node_update_backlog& node_backlog, abort_source& as);
~view_update_generator();
future<> start();

68
dist/CMakeLists.txt vendored
View File

@@ -141,4 +141,72 @@ add_dependencies(dist
dist-python3
dist-server)
set(dist_rpm_dir "${CMAKE_BINARY_DIR}/$<CONFIG>/dist/rpm")
set(dist_deb_dir "${CMAKE_BINARY_DIR}/$<CONFIG>/dist/deb")
# Map system processor to Debian architecture names
if(CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64")
set(deb_arch "amd64")
elseif(CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")
set(deb_arch "arm64")
else()
message(FATAL_ERROR "Unsupported architecture: ${CMAKE_SYSTEM_PROCESSOR}")
endif()
set(rpm_ver "${Scylla_VERSION}-${Scylla_RELEASE}")
set(deb_ver "${Scylla_VERSION}-${Scylla_RELEASE}-1")
set(rpm_arch "${CMAKE_SYSTEM_PROCESSOR}")
set(server_rpms_dir "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/redhat/RPMS/${rpm_arch}")
set(server_rpms
"${server_rpms_dir}/${Scylla_PRODUCT}-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-server-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-server-debuginfo-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-conf-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-kernel-conf-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-node-exporter-${rpm_ver}.${rpm_arch}.rpm")
set(cqlsh_rpms
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/redhat/RPMS/${rpm_arch}/${Scylla_PRODUCT}-cqlsh-${rpm_ver}.${rpm_arch}.rpm")
set(python3_rpms
"${CMAKE_SOURCE_DIR}/tools/python3/build/redhat/RPMS/${rpm_arch}/${Scylla_PRODUCT}-python3-${rpm_ver}.${rpm_arch}.rpm")
set(server_debs_dir "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/debian")
set(server_debs
"${server_debs_dir}/${Scylla_PRODUCT}_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-server_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-server-dbg_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-conf_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-kernel-conf_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-node-exporter_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/scylla-enterprise_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-server_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-conf_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-kernel-conf_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-node-exporter_${deb_ver}_all.deb")
set(cqlsh_debs
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/debian/${Scylla_PRODUCT}-cqlsh_${deb_ver}_${deb_arch}.deb"
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/debian/scylla-enterprise-cqlsh_${deb_ver}_all.deb")
set(python3_debs
"${CMAKE_SOURCE_DIR}/tools/python3/build/debian/${Scylla_PRODUCT}-python3_${deb_ver}_${deb_arch}.deb"
"${CMAKE_SOURCE_DIR}/tools/python3/build/debian/scylla-enterprise-python3_${deb_ver}_all.deb")
add_custom_target(collect-dist-rpm
COMMAND ${CMAKE_COMMAND} -E rm -rf ${dist_rpm_dir}
COMMAND ${CMAKE_COMMAND} -E make_directory ${dist_rpm_dir}
COMMAND ${CMAKE_COMMAND} -E copy ${server_rpms} ${cqlsh_rpms} ${python3_rpms} ${dist_rpm_dir}/
DEPENDS dist
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
COMMENT "Collecting RPMs into ${dist_rpm_dir}")
add_custom_target(collect-dist-deb
COMMAND ${CMAKE_COMMAND} -E rm -rf ${dist_deb_dir}
COMMAND ${CMAKE_COMMAND} -E make_directory ${dist_deb_dir}
COMMAND ${CMAKE_COMMAND} -E copy ${server_debs} ${cqlsh_debs} ${python3_debs} ${dist_deb_dir}/
DEPENDS dist
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
COMMENT "Collecting DEBs into ${dist_deb_dir}")
add_custom_target(collect-dist
DEPENDS collect-dist-rpm collect-dist-deb)
add_subdirectory(debuginfo)

View File

@@ -9,6 +9,22 @@ for f in "$etcdir"/scylla.d/*.conf; do
done
if is_privileged; then
# Override pipe-based core_pattern that may not work inside a container
# (e.g. Ubuntu host's apport). File-based patterns resolve inside the
# container's mount namespace, so coredumps land in the right place.
# Derive workdir from scylla.yaml, matching the Python entrypoint logic.
_workdir=$(python3 -c "import yaml; cfg=yaml.safe_load(open('/etc/scylla/scylla.yaml')); print(cfg.get('workdir') or '/var/lib/scylla')" 2>/dev/null || echo "/var/lib/scylla")
_coredump_dir="${_workdir}/coredump"
core_pattern=$(cat /proc/sys/kernel/core_pattern 2>/dev/null || true)
if [[ "$core_pattern" == "|"* ]]; then
if ! mkdir -p "$_coredump_dir" 2>/dev/null; then
echo "WARNING: could not create coredump directory $_coredump_dir" >&2
elif echo "${_coredump_dir}/core.%e.%p.%t" > /proc/sys/kernel/core_pattern 2>/dev/null; then
echo "kernel.core_pattern overridden to file-based pattern: ${_coredump_dir}/core.%e.%p.%t" >&2
else
echo "WARNING: pipe-based core_pattern detected but could not override. Coredumps may be lost." >&2
fi
fi
"$scriptsdir"/scylla_prepare
fi
execsudo /usr/bin/env SCYLLA_HOME=$SCYLLA_HOME SCYLLA_CONF=$SCYLLA_CONF "$bindir"/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET $SCYLLA_DOCKER_ARGS

View File

@@ -24,6 +24,7 @@ try:
setup.developerMode()
setup.cpuSet()
setup.io()
setup.coredumpSetup()
setup.cqlshrc()
setup.write_rackdc_properties()
setup.arguments()

View File

@@ -3,6 +3,7 @@ import logging
import yaml
import os
import socket
import errno
def is_bind_mount(path):
# Check if the file or its parent is a mount point (bind mount or otherwise)
@@ -47,6 +48,7 @@ class ScyllaSetup:
self._dc = arguments.dc
self._rack = arguments.rack
self._blocked_reactor_notify_ms = arguments.blocked_reactor_notify_ms
self._coredump_dir = None
def _run(self, *args, **kwargs):
logging.info('running: {}'.format(args))
@@ -132,6 +134,70 @@ class ScyllaSetup:
f.write(f"dc={dc}\n")
f.write(f"rack={rack}\n")
CORE_PATTERN_PATH = '/proc/sys/kernel/core_pattern'
def _get_coredump_dir(self):
"""Return the coredump directory, deriving it from scylla.yaml workdir if needed."""
if self._coredump_dir is not None:
return self._coredump_dir
conf_dir = "/etc/scylla"
try:
with open(os.path.join(conf_dir, "scylla.yaml")) as f:
cfg = yaml.safe_load(f) or {}
except Exception:
cfg = {}
workdir = cfg.get('workdir') or '/var/lib/scylla'
self._coredump_dir = os.path.join(workdir, 'coredump')
return self._coredump_dir
def coredumpSetup(self):
"""Configure coredump handling for containers.
The host's kernel.core_pattern may pipe core dumps to a handler
(e.g. Ubuntu's apport) that does not exist or work correctly
inside the container. This method tries to switch to a file-based
core_pattern so that coredumps are written directly to disk.
Writing to /proc/sys/kernel/core_pattern requires privileges
(root with CAP_SYS_ADMIN). When the container lacks permission
a warning is logged with guidance for the operator.
"""
coredump_dir = self._get_coredump_dir()
try:
os.makedirs(coredump_dir, exist_ok=True)
except OSError as e:
logging.warning('Could not create coredump directory %s: %s',
coredump_dir, e)
return
try:
with open(self.CORE_PATTERN_PATH) as f:
current = f.read().strip()
except Exception as e:
logging.debug('Could not read %s: %s', self.CORE_PATTERN_PATH, e)
return
if not current.startswith('|'):
return
desired = f'{coredump_dir}/core.%e.%p.%t'
try:
with open(self.CORE_PATTERN_PATH, 'w') as f:
f.write(desired + '\n')
logging.info('kernel.core_pattern set to %s', desired)
except OSError as e:
if e.errno in (errno.EACCES, errno.EPERM, errno.EROFS):
logging.warning(
'kernel.core_pattern pipes to a program that may not work '
'inside the container, and we lack permission to override it. '
'To fix this, either run with --privileged or set on the host: '
'sysctl -w kernel.core_pattern="%s"', desired)
else:
logging.debug('Unexpected OSError setting core_pattern: %s', e)
except Exception as e:
logging.debug('Unexpected error in coredumpSetup: %s', e)
def arguments(self):
args = []
if self._memory is not None:

View File

@@ -324,6 +324,13 @@ experimental:
stream events. Without this option, such no-op operations may still
generate spurious stream events.
<https://github.com/scylladb/scylladb/issues/28368>
* When a stream is disabled, no new records are written but the existing
stream data is preserved and remains readable through its original
StreamArn. The data expires via TTL after 24 hours. Re-enabling the
stream purges the old data immediately and produces a new StreamArn.
In contrast, DynamoDB keeps the old stream and its data readable for
24 hours through the old StreamArn even after re-enabling.
<https://scylladb.atlassian.net/browse/SCYLLADB-1873>
## Unimplemented API features

View File

@@ -1,5 +1,11 @@
# Alternator Vector Search
```{admonition} Availability
:class: important
The Vector Search feature is only available in [ScyllaDB Cloud](https://cloud.docs.scylladb.com/) - a fully managed DBaaS running ScyllaDB.
```
## Introduction
Alternator vector search is a ScyllaDB extension to the DynamoDB-compatible

View File

@@ -415,7 +415,7 @@ An empty list is allowed, and it's equivalent to numeric replication factor of 0
.. code-block:: cql
ALTER KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', dc2' : []};
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc2' : []};
Altering from a rack list to a numeric replication factor is not supported.
@@ -1017,11 +1017,11 @@ For example:
CREATE TABLE customer_data (
cust_id uuid,
cust_first-name text,
cust_last-name text,
"cust_first-name" text,
"cust_last-name" text,
cust_phone text,
cust_get-sms text,
PRIMARY KEY (customer_id)
"cust_get-sms" text,
PRIMARY KEY (cust_id)
) WITH cdc = { 'enabled' : 'true', 'preimage' : 'true' };
.. _cql-caching-options:

View File

@@ -24,7 +24,8 @@ For example:
INSERT INTO NerdMovies (movie, director, main_actor, year)
VALUES ('Serenity', 'Joss Whedon', 'Nathan Fillion', 2005)
USING TTL 86400 IF NOT EXISTS;
IF NOT EXISTS
USING TTL 86400;
The ``INSERT`` statement writes one or more columns for a given row in a table. Note that since a row is identified by
its ``PRIMARY KEY``, at least the columns composing it must be specified. The list of columns to insert to must be

View File

@@ -71,7 +71,7 @@ used. If it is used, the statement will be a no-op if the materialized view alre
MV Select Statement
...................
The select statement of a materialized view creation defines which of the base table is included in the view. That
The select statement of a materialized view creation defines which of the base table columns are included in the view. That
statement is limited in a number of ways:
- The :ref:`selection <selection-clause>` is limited to those that only select columns of the base table. In other

View File

@@ -507,7 +507,7 @@ For example::
CREATE TABLE superheroes (
name frozen<full_name> PRIMARY KEY,
home address
home frozen<address>
);
.. note::

View File

@@ -167,6 +167,11 @@ All tables in a keyspace are uploaded, the destination object names will look li
or
`gs://bucket/some/prefix/to/store/data/.../sstable`
# System tables
There are a few system tables that object storage related code needs to touch in order to operate.
* [system_distributed.snapshot_sstables](docs/dev/snapshot_sstables.md) - Used during restore by worker nodes to get the list of SSTables that need to be downloaded from object storage and restored locally.
* [system.sstables](docs/dev/system_keyspace.md#systemsstables) - Used to keep track of SSTables on object storage when a keyspace is created with object storage storage_options.
# Manipulating S3 data
This section intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands

View File

@@ -0,0 +1,52 @@
# system\_distributed.snapshot\_sstables
## Purpose
This table is used during tablet-aware restore to exchange per-SSTable metadata between
the coordinator and worker nodes. When the restore process starts, the coordinator node
populates this table with information about each SSTable extracted from the snapshot
manifests. Worker nodes then read from this table to determine which SSTables need to
be downloaded from object storage and restored locally.
Rows are inserted with a TTL so that stale restore metadata is automatically cleaned up.
## Schema
~~~
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
"keyspace" text,
"table" text,
datacenter text,
rack text,
first_token bigint,
sstable_id uuid,
last_token bigint,
toc_name text,
prefix text,
PRIMARY KEY ((snapshot_name, "keyspace", "table", datacenter, rack), first_token, sstable_id)
)
~~~
Column descriptions:
| Column | Type | Description |
|--------|------|-------------|
| `snapshot_name` | text (partition key) | Name of the snapshot |
| `keyspace` | text (partition key) | Keyspace the snapshot was taken from |
| `table` | text (partition key) | Table within the keyspace |
| `datacenter` | text (partition key) | Datacenter where the SSTable is located |
| `rack` | text (partition key) | Rack where the SSTable is located |
| `first_token` | bigint (clustering key) | First token in the token range covered by this SSTable |
| `sstable_id` | uuid (clustering key) | Unique identifier for the SSTable |
| `last_token` | bigint | Last token in the token range covered by this SSTable |
| `toc_name` | text | TOC filename of the SSTable (e.g. `me-3gdq_0bki_2cvk01yl83nj0tp5gh-big-TOC.txt`) |
| `prefix` | text | Prefix path in object storage where the SSTable was backed up |
## APIs
The following C++ APIs are provided in `db::system_distributed_keyspace`:
- insert\_snapshot\_sstable
- get\_snapshot\_sstables

View File

@@ -274,6 +274,8 @@ globally driven by the topology change coordinator and serialized per-tablet. Tr
- repair - tablet replicas are repaired
- restore - tablet replicas download SSTables from object storage during cluster-wide backup restore
Each tablet has its own state machine for keeping state of transition stored in group0 which is part of the tablet state. It involves
these properties of a tablet:
@@ -390,6 +392,9 @@ stateDiagram-v2
The repair tablet transition kind is different. It transits only to the repair and end_repair stage because no token ownership is changed.
The restore tablet transition kind is also simple. It uses a single `restore` stage and does not change token
ownership. See the [Tablet-aware restore](#tablet-aware-restore) section below for details.
The behavioral difference between "migration" and "intranode_migration" transitions is in the way "streaming" stage
is performed. In case of intra-node migration, streaming is done by fast duplication of data by creating hard links to
sstable files on the destination shard. Original sstable files on the source shard will be removed by the standard "cleanup" stage.
@@ -984,3 +989,18 @@ Losing a committed entry can be observed by external systems. For example, the l
schema version in the cluster can go back in time from the driver's perspective. This
is outside the scope of the recovery procedure, though, and it shouldn't cause
problems in practice.
# Tablet restore transition
The `restore` tablet transition kind is used by the tablet-aware restore to download SSTables
from object storage. The transition contains `restore_config` with snapshot name, endpoint and
bucket.
Like `repair`, the `restore` transition does not change token ownership — replicas remain intact.
The topology coordinator processes a tablet in this stage by calling the `RESTORE_TABLET` RPC on
all tablet replicas. Each replica then downloads and attaches the SSTables that are contained in
the tablet's token range. If the operation succeeds or fails, the transition is cleared and the
failure to download SSTables is propagated back to user by the API handler itself.
Restore transitions are serialized per-tablet like any other transition (invariant [INV-TABL-2]),
so they do not run concurrently with migrations or repairs on the same tablet.

View File

@@ -106,6 +106,7 @@ The most important table is `system.view_building_tasks`, which stores all unfin
CREATE TABLE system.view_building_tasks (
key text,
id timeuuid,
min_task_id timeuuid STATIC, -- lower bound for task scans; see "Tombstone avoidance" below
type text,
aborted boolean,
base_id uuid,
@@ -117,6 +118,26 @@ CREATE TABLE system.view_building_tasks (
)
```
### Tombstone avoidance
`system.view_building_tasks` is a single partition. When `finished_task_gc_fiber()` removes
finished tasks in batches, the deleted rows remain as tombstones in SSTables until compaction,
causing `tombstone_warn_threshold` warnings on subsequent reloads in large clusters.
Two mechanisms address this:
**Range tombstone on GC.** Instead of one row tombstone per deleted task, the coordinator emits
a single range tombstone `[before_all, min_alive_uuid)` where `min_alive_uuid` is the smallest
timeuuid among surviving tasks. Tasks above the boundary (rare) still get individual row tombstones.
When all tasks are deleted, a single full-partition range tombstone is used.
**Bounded scan on reload.** Physical rows remain until compaction and are still counted as dead cells.
After each GC batch, `min_task_id = min_alive_uuid` is written atomically as a static cell (same Raft
batch as the range tombstone). On reload, `min_task_id` is read using a **static-only partition slice**
(empty `_row_ranges` + `always_return_static_content`) — this makes the SSTable reader stop immediately
after the static row, before any clustering tombstones, so zero dead cells are counted. The value is
then used as `AND id >= min_task_id` to skip all tombstoned rows in the main scan.
The view building coordinator stores currently processing base table in `system.scylla_local`
under `view_building_processing_base` key.
The entry is managed by group0.

View File

@@ -45,7 +45,7 @@ Example:
.. code-block:: console
nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode 675ed9f4-6564-6dbd-ca08-43fddce952de
To only mark the node as permanently down without doing actual removal, use :doc:`nodetool excludenode </operating-scylla/nodetool-commands/excludenode>`:
@@ -79,6 +79,6 @@ Example:
.. code-block:: console
nodetool removenode --ignore-dead-nodes 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode --ignore-dead-nodes 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1db0-aac8-43fddce9123e 675ed9f4-6564-6dbd-ca08-43fddce952de
.. include:: nodetool-index.rst

View File

@@ -74,7 +74,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UJ 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UJ 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Nodes in the cluster finished streaming data to the new node:
@@ -86,7 +86,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. When the new node status is Up Normal (UN), run the :doc:`nodetool cleanup </operating-scylla/nodetool-commands/cleanup>` command on all nodes in the cluster except for the new node that has just been added. Cleanup removes keys that were streamed to the newly added node and are no longer owned by the node.

View File

@@ -192,7 +192,7 @@ Adding new nodes
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.10 500 MB 256 33.3% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c RACK0
UN 192.168.1.11 500 MB 256 33.3% 125ed9f4-7777-1dbn-mac8-43fddce9123e RACK1
UN 192.168.1.12 500 MB 256 33.3% 675ed9f4-6564-6dbd-can8-43fddce952gy RACK2
UN 192.168.1.12 500 MB 256 33.3% 675ed9f4-6564-6dbd-ca08-43fddce952de RACK2
UJ 192.168.2.10 250 MB 256 ? a1b2c3d4-5678-90ab-cdef-112233445566 RACK0
**Example output after bootstrap completes:**
@@ -205,7 +205,7 @@ Adding new nodes
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.10 400 MB 256 25.0% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c RACK0
UN 192.168.1.11 400 MB 256 25.0% 125ed9f4-7777-1dbn-mac8-43fddce9123e RACK1
UN 192.168.1.12 400 MB 256 25.0% 675ed9f4-6564-6dbd-can8-43fddce952gy RACK2
UN 192.168.1.12 400 MB 256 25.0% 675ed9f4-6564-6dbd-ca08-43fddce952de RACK2
UN 192.168.2.10 400 MB 256 25.0% a1b2c3d4-5678-90ab-cdef-112233445566 RACK0
#. For tablets-enabled clusters, wait for tablet load balancing to complete.

View File

@@ -163,5 +163,5 @@ This example shows how to install and configure a three-node cluster using Gossi
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c 43
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e 44
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy 45
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de 45

View File

@@ -19,7 +19,7 @@ Prerequisites
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-lac8-23fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Datacenter: ASIA-DC
Status=Up/Down
@@ -165,7 +165,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Datacenter: EUROPE-DC
Status=Up/Down

View File

@@ -18,7 +18,7 @@ Removing a Running Node
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. If the node status is **Up Normal (UN)**, run the :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` command
to remove the node you are connected to. Using ``nodetool decommission`` is the recommended method for cluster scale-down operations. It prevents data loss
@@ -75,7 +75,7 @@ command providing the Host ID of the node you are removing. See :doc:`nodetool r
.. code-block:: console
nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode 675ed9f4-6564-6dbd-ca08-43fddce952de
The ``nodetool removenode`` command notifies other nodes that the token range it owns needs to be moved and
the nodes should redistribute the data using streaming. Using the command does not guarantee the consistency of the rebalanced data if

View File

@@ -23,7 +23,7 @@ Prerequisites
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
DN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Login to one of the nodes in the cluster with (UN) status, collect the following info from the node:

View File

@@ -29,7 +29,7 @@ Down (DN), and the node can be replaced.
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Remove the Data
==================
@@ -72,7 +72,7 @@ Procedure
For example (using the Host ID of the failed node from above):
``replace_node_first_boot: 675ed9f4-6564-6dbd-can8-43fddce952gy``
``replace_node_first_boot: 675ed9f4-6564-6dbd-ca08-43fddce952de``
#. Start the new node.
@@ -90,7 +90,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
``192.168.1.203`` is the dead node.
@@ -121,7 +121,7 @@ Procedure
/192.168.1.203
generation:1553759866
heartbeat:2147483647
HOST_ID:675ed9f4-6564-6dbd-can8-43fddce952gy
HOST_ID:675ed9f4-6564-6dbd-ca08-43fddce952de
STATUS:shutdown,true
RELEASE_VERSION:3.0.8
X3:3
@@ -178,7 +178,7 @@ In this case, the node's data will be cleaned after restart. To remedy this, you
.. code-block:: none
echo 'replace_node_first_boot: 675ed9f4-6564-6dbd-can8-43fddce952gy' | sudo tee --append /etc/scylla/scylla.yaml
echo 'replace_node_first_boot: 675ed9f4-6564-6dbd-ca08-43fddce952de' | sudo tee --append /etc/scylla/scylla.yaml
#. Run the following command to re-setup RAID

View File

@@ -1,5 +1,5 @@
Migrate a Keyspace from Vnodes to Tablets
==========================================
Migrate a Keyspace from Vnodes to Tablets :label-caution:`Experimental`
=========================================================================
This procedure describes how to migrate an existing keyspace from vnodes
to tablets. Tablets are designed to be the long-term replacement for vnodes,
@@ -8,6 +8,9 @@ balancing, automatic cleanups, and improved streaming performance. Migrating to
tablets is strongly recommended. See :doc:`Data Distribution with Tablets </architecture/tablets/>`
for details.
This feature is experimental and will change in future releases, including
the removal of current limitations.
.. note::
The migration is an online operation. This means that the keyspace remains

View File

@@ -16,7 +16,7 @@ Cluster and Node Limits
* - Nodes per cluster
- Low hundreds
* - Node size
- 256 vcpu
- 4096 CPUs
See :ref:`Hardware Requirements <system-requirements-hardware>` for storage
and memory requirements and limits.

View File

@@ -4,7 +4,7 @@ Upgrade ScyllaDB
.. toctree::
ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1/index>
ScyllaDB 2026.1 to ScyllaDB 2026.2 <upgrade-guide-from-2026.1-to-2026.2/index>
ScyllaDB 2026.x Patch Upgrades <upgrade-guide-from-2026.x.y-to-2026.x.z>
ScyllaDB Image <ami-upgrade>

View File

@@ -1,13 +0,0 @@
==========================================================
Upgrade - ScyllaDB 2025.x to ScyllaDB 2026.1
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2025.x-to-2026.1>
Metrics Update <metric-update-2025.x-to-2026.1>
* :doc:`Upgrade from ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1>`
* :doc:`Metrics Update Between 2025.x and 2026.1 <metric-update-2025.x-to-2026.1>`

View File

@@ -1,82 +0,0 @@
.. |SRC_VERSION| replace:: 2025.x
.. |NEW_VERSION| replace:: 2026.1
.. |PRECEDING_VERSION| replace:: 2025.4
================================================================
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_alternator_operation_size_kb
- Histogram of item sizes involved in a request.
* - scylla_column_family_total_disk_space_before_compression
- Hypothetical total disk space used if data files weren't compressed
* - scylla_group_name_auto_repair_enabled_nr
- Number of tablets with auto repair enabled.
* - scylla_group_name_auto_repair_needs_repair_nr
- Number of tablets with auto repair enabled that currently need repair.
* - scylla_lsa_compact_time_ms
- Total time spent on segment compaction that was not accounted under ``reclaim_time_ms``.
* - scylla_lsa_evict_time_ms
- Total time spent on evicting objects that was not accounted under ``reclaim_time_ms``,
* - scylla_lsa_reclaim_time_ms
- Total time spent in reclaiming LSA memory back to std allocator.
* - scylla_object_storage_memory_usage
- Total number of bytes consumed by the object storage client.
* - scylla_tablet_ops_failed
- Number of failed tablet auto repair attempts.
* - scylla_tablet_ops_succeeded
- Number of successful tablet auto repair attempts.
Renamed Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are renamed in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric Name in |PRECEDING_VERSION|
- Metric Name in |NEW_VERSION|
* - scylla_s3_memory_usage
- scylla_object_storage_memory_usage
Removed Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are removed in ScyllaDB |NEW_VERSION|.
* scylla_redis_current_connections
* scylla_redis_op_latency
* scylla_redis_operation
* scylla_redis_operation
* scylla_redis_requests_latency
* scylla_redis_requests_served
* scylla_redis_requests_serving
New and Updated Metrics in Previous Releases
-------------------------------------------------------
* `Metrics Update Between 2025.3 and 2025.4 <https://docs.scylladb.com/manual/branch-2025.4/upgrade/upgrade-guides/upgrade-guide-from-2025.x-to-2025.4/metric-update-2025.x-to-2025.4.html>`_
* `Metrics Update Between 2025.2 and 2025.3 <https://docs.scylladb.com/manual/branch-2025.3/upgrade/upgrade-guides/upgrade-guide-from-2025.2-to-2025.3/metric-update-2025.2-to-2025.3.html>`_
* `Metrics Update Between 2025.1 and 2025.2 <https://docs.scylladb.com/manual/branch-2025.2/upgrade/upgrade-guides/upgrade-guide-from-2025.1-to-2025.2/metric-update-2025.1-to-2025.2.html>`_

View File

@@ -0,0 +1,13 @@
==========================================================
Upgrade - ScyllaDB 2026.1 to ScyllaDB 2026.2
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2026.1-to-2026.2>
Metrics Update <metric-update-2026.1-to-2026.2>
* :doc:`Upgrade from ScyllaDB 2026.1 to ScyllaDB 2026.2 <upgrade-guide-from-2026.1-to-2026.2>`
* :doc:`Metrics Update Between 2026.1 and 2026.2 <metric-update-2026.1-to-2026.2>`

View File

@@ -0,0 +1,126 @@
.. |SRC_VERSION| replace:: 2026.1
.. |NEW_VERSION| replace:: 2026.2
.. |PRECEDING_VERSION| replace:: 2026.1
================================================================
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_auth_cache_permissions
- Total number of permission sets currently cached across all roles.
* - scylla_auth_cache_roles
- Number of roles currently cached.
* - scylla_cql_forwarded_requests
- Counts the total number of attempts to forward CQL requests to other nodes.
One request may be forwarded multiple times, particularly when a write is
handled by a non-replica node.
* - scylla_cql_write_consistency_levels_disallowed_violations
- Counts the number of write_consistency_levels_disallowed guardrail violations,
i.e. attempts to write with a forbidden consistency level.
* - scylla_cql_write_consistency_levels_warned_violations
- Counts the number of write_consistency_levels_warned guardrail violations,
i.e. attempts to write with a discouraged consistency level.
* - scylla_cql_writes_per_consistency_level
- Counts the number of writes for each consistency level.
* - scylla_io_queue_integrated_disk_queue_length
- Length of the integrated disk queue.
* - scylla_io_queue_integrated_queue_length
- Length of the integrated queue.
* - scylla_logstor_sm_bytes_freed
- Counts the number of data bytes freed.
* - scylla_logstor_sm_bytes_read
- Counts the number of bytes read from the disk.
* - scylla_logstor_sm_bytes_written
- Counts the number of bytes written to the disk.
* - scylla_logstor_sm_compaction_bytes_written
- Counts the number of bytes written to the disk by compaction.
* - scylla_logstor_sm_compaction_data_bytes_written
- Counts the number of data bytes written to the disk by compaction.
* - scylla_logstor_sm_compaction_records_rewritten
- Counts the number of records rewritten during compaction.
* - scylla_logstor_sm_compaction_records_skipped
- Counts the number of records skipped during compaction.
* - scylla_logstor_sm_compaction_segments_freed
- Counts the number of data bytes written to the disk.
* - scylla_logstor_sm_disk_usage
- Total disk usage.
* - scylla_logstor_sm_free_segments
- Counts the number of free segments currently available.
* - scylla_logstor_sm_segment_pool_compaction_segments_get
- Counts the number of segments taken from the segment pool for compaction.
* - scylla_logstor_sm_segment_pool_normal_segments_get
- Counts the number of segments taken from the segment pool for normal writes.
* - scylla_logstor_sm_segment_pool_normal_segments_wait
- Counts the number of times normal writes had to wait for a segment to become
available in the segment pool.
* - scylla_logstor_sm_segment_pool_segments_put
- Counts the number of segments returned to the segment pool.
* - scylla_logstor_sm_segment_pool_separator_segments_get
- Counts the number of segments taken from the segment pool for separator writes.
* - scylla_logstor_sm_segment_pool_size
- Counts the number of segments in the segment pool.
* - scylla_logstor_sm_segments_allocated
- Counts the number of segments allocated.
* - scylla_logstor_sm_segments_compacted
- Counts the number of segments compacted.
* - scylla_logstor_sm_segments_freed
- Counts the number of segments freed.
* - scylla_logstor_sm_segments_in_use
- Counts the number of segments currently in use.
* - scylla_logstor_sm_separator_buffer_flushed
- Counts the number of times the separator buffer has been flushed.
* - scylla_logstor_sm_separator_bytes_written
- Counts the number of bytes written to the separator.
* - scylla_logstor_sm_separator_data_bytes_written
- Counts the number of data bytes written to the separator.
* - scylla_logstor_sm_separator_flow_control_delay
- Current delay applied to writes to control separator debt in microseconds.
* - scylla_logstor_sm_separator_segments_freed
- Counts the number of segments freed by the separator.
* - scylla_transport_cql_pending_response_memory
- Holds the total memory in bytes consumed by responses waiting to be sent.
* - scylla_transport_cql_request_histogram_bytes
- A histogram of received bytes in CQL messages of a specific kind and
specific scheduling group.
* - scylla_transport_cql_requests_serving
- Holds the number of requests that are being processed right now.
* - scylla_transport_cql_response_histogram_bytes
- A histogram of received bytes in CQL messages of a specific kind and
specific scheduling group.
* - scylla_transport_requests_forwarded_failed
- Counts the number of requests that were forwarded to another replica
but failed to execute there.
* - scylla_transport_requests_forwarded_prepared_not_found
- Counts the number of requests that were forwarded to another replica
but failed there because the statement was not prepared on the target.
When this happens, the coordinator performs an additional remote call
to prepare the statement on the replica and retries the EXECUTE request
afterwards.
* - scylla_transport_requests_forwarded_redirected
- Counts the number of requests that were forwarded to another replica
but that replica responded with a redirect to another node. This can
happen when replica has stale information about the cluster topology or
when the request is handled by a node that is not a replica for the data
being accessed by the request.
* - scylla_transport_requests_forwarded_successfully
- Counts the number of requests that were forwarded to another replica
and executed successfully there.

View File

@@ -1,13 +1,13 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 2025.x
.. |NEW_VERSION| replace:: 2026.1
.. |SRC_VERSION| replace:: 2026.1
.. |NEW_VERSION| replace:: 2026.2
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2025.x to 2026.1
.. _SCYLLA_METRICS: ../metric-update-2025.x-to-2026.1
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2026.1 to 2026.2
.. _SCYLLA_METRICS: ../metric-update-2026.1-to-2026.2
=======================================================================================
Upgrade from |SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION|

View File

@@ -289,8 +289,8 @@ private:
sstring _host;
host_options& _options;
output_stream<char> _output;
input_stream<char> _input;
std::optional<output_stream<char>> _output;
std::optional<input_stream<char>> _input;
seastar::connected_socket _socket;
std::optional<temporary_buffer<char>> _in_buffer;
std::optional<future<>> _pending;
@@ -347,8 +347,8 @@ future<> kmip_host::impl::connection::connect() {
// #998 Set keepalive to try avoiding connection going stale in between commands.
s.set_keepalive_parameters(net::tcp_keepalive_params{60s, 60s, 10});
s.set_keepalive(true);
_input = s.input();
_output = s.output();
_input.emplace(s.input());
_output.emplace(s.output());
});
});
});
@@ -367,9 +367,9 @@ int kmip_host::impl::connection::send(void* data, unsigned int len, unsigned int
}
kmip_log.trace("{}: Sending {} bytes", *this, len);
auto f = _output.write(reinterpret_cast<char *>(data), len).then([this] {
auto f = _output->write(reinterpret_cast<char *>(data), len).then([this] {
kmip_log.trace("{}: send done. flushing...", *this);
return _output.flush();
return _output->flush();
});
// if the call failed already, we still want to
// drop back to "wait_for_io()", because we cannot throw
@@ -405,7 +405,7 @@ int kmip_host::impl::connection::recv(void* data, unsigned int len, unsigned int
}
kmip_log.trace("{}: issue read", *this);
auto f = _input.read().then([this](temporary_buffer<char> buf) {
auto f = _input->read().then([this](temporary_buffer<char> buf) {
kmip_log.trace("{}: got {} bytes", *this, buf.size());
_in_buffer = std::move(buf);
});
@@ -462,8 +462,8 @@ void kmip_host::impl::connection::attach(KMIP_CMD* cmd) {
}
future<> kmip_host::impl::connection::close() {
return _output.close().finally([this] {
return _input.close();
return _output->close().finally([this] {
return _input->close();
});
}
@@ -598,7 +598,7 @@ future<int> kmip_host::impl::do_cmd(KMIP_CMD* cmd, con_ptr cp, Func& f, bool ret
template<typename Func>
future<kmip_host::impl::kmip_cmd> kmip_host::impl::do_cmd(kmip_cmd cmd_in, Func && f) {
kmip_log.trace("{}: begin do_cmd", *this, cmd_in);
kmip_log.trace("{}: begin do_cmd {}", *this, cmd_in);
KMIP_CMD* cmd = cmd_in;
// #998 Need to do retry loop, because we can have either timed out connection,

View File

@@ -616,7 +616,7 @@ future<rjson::value> encryption::kms_host::impl::do_post(std::string_view target
static auto get_xml_node = [](node_type* node, const char* what) {
auto res = node->first_node(what);
if (!res) {
throw malformed_response_error(fmt::format("XML parse error", what));
throw malformed_response_error(fmt::format("XML parse error: {}", what));
}
return res;
};

View File

@@ -109,6 +109,7 @@ std::set<std::string_view> feature_service::supported_feature_set() const {
"UUID_SSTABLE_IDENTIFIERS"sv,
"GROUP0_SCHEMA_VERSIONING"sv,
"VIEW_BUILD_STATUS_ON_GROUP0"sv,
"CDC_GENERATIONS_V2"sv,
};
if (is_test_only_feature_deprecated()) {

View File

@@ -83,7 +83,6 @@ public:
gms::feature alternator_ttl { *this, "ALTERNATOR_TTL"sv };
gms::feature cql_row_ttl { *this, "CQL_ROW_TTL"sv };
gms::feature range_scan_data_variant { *this, "RANGE_SCAN_DATA_VARIANT"sv };
gms::feature cdc_generations_v2 { *this, "CDC_GENERATIONS_V2"sv };
gms::feature user_defined_aggregates { *this, "UDA"sv };
// Historically max_result_size contained only two fields: soft_limit and
// hard_limit. It was somehow obscure because for normal paged queries both
@@ -183,6 +182,7 @@ public:
gms::feature arbitrary_tablet_boundaries { *this, "ARBITRARY_TABLET_BOUNDARIES"sv };
gms::feature large_data_virtual_tables { *this, "LARGE_DATA_VIRTUAL_TABLES"sv };
gms::feature keyspace_multi_rf_change { *this, "KEYSPACE_MULTI_RF_CHANGE"sv };
gms::feature view_building_tasks_min_task_id { *this, "VIEW_BUILDING_TASKS_MIN_TASK_ID"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -399,9 +399,10 @@ future<> gossiper::do_send_ack2_msg(locator::host_id from, utils::chunked_vector
}
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
auto ack2_msg_str = fmt::format("{}", ack2_msg);
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
}
// Depends on
@@ -964,8 +965,7 @@ future<> gossiper::failure_detector_loop_for_node(locator::host_id host_id, gene
diff = now - last;
if (!failed) {
last = now;
}
if (diff > max_duration) {
} else if (diff > max_duration) {
logger.info("failure_detector_loop: Mark node {}/{} as DOWN", host_id, node);
co_await container().invoke_on(0, [host_id] (gms::gossiper& g) {
return g.convict(host_id);

View File

@@ -53,6 +53,7 @@ set(idl_headers
group0.idl.hh
hinted_handoff.idl.hh
sstables.idl.hh
sstables_loader.idl.hh
storage_proxy.idl.hh
storage_service.idl.hh
strong_consistency/state_machine.idl.hh

View File

@@ -0,0 +1,12 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
class restore_result {
};
verb [[]] restore_tablet (raft::server_id dst_id, locator::global_tablet_id gid) -> restore_result;

View File

@@ -72,6 +72,7 @@ struct raft_topology_cmd_result {
success
};
service::raft_topology_cmd_result::command_status status;
sstring error_message [[version 2026.2]];
};
struct raft_snapshot {

View File

@@ -5,6 +5,8 @@ target_sources(index
PRIVATE
secondary_index.cc
secondary_index_manager.cc
fulltext_index.cc
index_option_utils.cc
vector_index.cc)
target_include_directories(index
PUBLIC

96
index/fulltext_index.cc Normal file
View File

@@ -0,0 +1,96 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "cql3/statements/index_target.hh"
#include "cql3/util.hh"
#include "exceptions/exceptions.hh"
#include "schema/schema.hh"
#include "index/fulltext_index.hh"
#include "index/index_option_utils.hh"
#include "index/secondary_index_manager.hh"
#include "utils/UUID_gen.hh"
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
namespace secondary_index {
// Supported text analyzers for fulltext indexing.
// This list corresponds to analyzers expected to be provided
// by the backend search engine (Tantivy).
static const std::vector<sstring> analyzer_values = {
"standard", "english", "german", "french", "spanish", "italian", "portuguese", "russian", "chinese", "japanese", "korean", "simple", "whitespace"};
const static std::unordered_map<sstring, std::function<void(std::string_view, const sstring&, const sstring&)>> fulltext_index_options = {
// 'analyzer' specifies the built-in text analyzer to use for tokenization.
{"analyzer", std::bind_front(util::validate_enumerated_option, analyzer_values)},
// 'positions' controls whether token positions are stored in the index.
// Required for phrase queries. Set to false to save space.
{"positions", std::bind_front(util::validate_enumerated_option, util::boolean_values)},
};
bool fulltext_index::view_should_exist() const {
return false;
}
std::optional<cql3::description> fulltext_index::describe(const index_metadata& im, const schema& base_schema) const {
auto target = im.options().at(cql3::statements::index_target::target_option_name);
auto target_column = cql3::statements::index_target::column_name_from_target_string(target);
return describe_with_target(im, base_schema, cql3::util::maybe_quote(target_column));
}
void fulltext_index::check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const {
using cql3::statements::index_target;
if (targets.size() != 1) {
throw exceptions::invalid_request_exception("Fulltext index must have exactly one target column");
}
auto& target = targets[0];
if (!std::holds_alternative<index_target::single_column>(target->value)) {
throw exceptions::invalid_request_exception("Fulltext index target must be a single column");
}
auto& column = std::get<index_target::single_column>(target->value);
auto c_name = column->to_string();
auto const* c_def = schema.get_column_definition(column->name());
if (c_def == nullptr) {
throw exceptions::invalid_request_exception(format("Column {} not found in schema", c_name));
}
auto kind = c_def->type->get_kind();
if (kind != abstract_type::kind::utf8 && kind != abstract_type::kind::ascii) {
throw exceptions::invalid_request_exception(
format("Fulltext index is only supported on text, varchar, or ascii columns, but column {} has an incompatible type", c_name));
}
}
void fulltext_index::check_index_options(const cql3::statements::index_specific_prop_defs& properties) const {
for (auto option : properties.get_raw_options()) {
auto it = fulltext_index_options.find(option.first);
if (it == fulltext_index_options.end()) {
throw exceptions::invalid_request_exception(format("Unsupported option {} for fulltext index", option.first));
}
it->second(index_type_name(), option.first, option.second);
}
}
void fulltext_index::validate(const schema& schema, const cql3::statements::index_specific_prop_defs& properties,
const std::vector<::shared_ptr<cql3::statements::index_target>>& targets, const gms::feature_service&, const data_dictionary::database&) const {
check_target(schema, targets);
check_index_options(properties);
}
utils::UUID fulltext_index::index_version(const schema& schema) {
return utils::UUID_gen::get_time_UUID();
}
std::unique_ptr<secondary_index::custom_index> fulltext_index_factory() {
return std::make_unique<fulltext_index>();
}
} // namespace secondary_index

43
index/fulltext_index.hh Normal file
View File

@@ -0,0 +1,43 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "schema/schema.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/statements/index_target.hh"
#include "index/secondary_index_manager.hh"
#include <vector>
namespace secondary_index {
class fulltext_index : public custom_index {
public:
std::string_view index_type_name() const override {
return "fulltext";
}
fulltext_index() = default;
~fulltext_index() override = default;
std::optional<cql3::description> describe(const index_metadata& im, const schema& base_schema) const override;
bool view_should_exist() const override;
void validate(const schema& schema, const cql3::statements::index_specific_prop_defs& properties,
const std::vector<::shared_ptr<cql3::statements::index_target>>& targets, const gms::feature_service& fs,
const data_dictionary::database& db) const override;
utils::UUID index_version(const schema& schema) override;
private:
void check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const;
void check_index_options(const cql3::statements::index_specific_prop_defs& properties) const;
};
std::unique_ptr<secondary_index::custom_index> fulltext_index_factory();
} // namespace secondary_index

Some files were not shown because too many files have changed in this diff Show More