Compare commits

...

128 Commits

Author SHA1 Message Date
Alex
a0303bfd41 test/auth_cluster: simulate v1 state in self-heal test
When skip_service_levels_v2_initialization is used, write an explicit
v1 service level version marker while skipping v2 initialization. This
lets the restart test exercise self-healing from v1 to v2.
2026-05-13 16:00:02 +03:00
Alex
12dfd9b487 qos: self-heal stale service levels version on startup
Add self_heal_service_levels_version() and use it during startup when
  the node is already on raft topology but service levels are still marked
  as v1.

  In that stale state, migrate service levels to v2 through group0 instead
  of failing startup.
2026-05-13 16:00:02 +03:00
Alex
ac0a19aab8 qos: reintroduce service levels v2 migration self-heal
migrate_to_v2() was removed after gossip-based service level migration
  support was dropped, since upgraded nodes were expected to already use
  service levels v2.

  However, clusters affected by the old migration bug may reach raft topology
  while system.scylla_local still has a stale service level version. Restore
  the migration helper so startup can self-heal those nodes by writing the v2
  state through group0.
2026-05-13 10:16:02 +03:00
Yaniv Michael Kaul
5d6f160129 test: update get_scylla_2025_1_executable() to use 2025.1.12
Update the hardcoded 2025.1.0 binary URL to the latest 2025.1.12
release for upgrade tests.

The 2025.1.12 binary now supports and enforces the
rf_rack_valid_keyspaces option which the test harness enables by
default. Since test_sstable_compression_dictionaries_upgrade creates
a 2-node cluster in a single rack with RF=2, it violates the
constraint. Disable the option explicitly for this test.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29714
2026-05-12 23:20:55 +02:00
Wojciech Mitros
f3cf20803b test: run test_mv_admission_control_exception on one shard
In the test we perform 2 consecutive writes where the first write
is supposed to increase the view update backlog above the mv
admission control threshold and the second one is expected to be
rejected because of that.

On each node/shard we have 2 types of view update backlogs:
 1. for deciding whether we should admit writes
 2. for propagating the backlog information to other nodes/shards.

For the second write to be rejected, it must be performed on a node
and shard which updated its backlog of type 1.

The view update backlog of type 2. is immediately increased on the
base table replica. For this backlog to be registered as a backlog
of type 1., it needs to be either carried by gossip (happening once
every second) or by attaching it to a replica write response. We
don't want to increase the runtime of tests unnecessarily, so we don't
wait and we rely on the second mechanism. The response to the first
base table write (the one causing increase in the backlog) carries
the increased backlog to the coordinator of this write. So for the
second write to observe the increased backlog, it needs to be coordinated
on the same node+shard as the first write.

We make sure that both writes are coordinated on the same node+shard by
using prepared statements combined with setting the host in `run_async`.
Both writes target the same partition and with prepared statements we
route them directly to the correct shard.

That was the idea, at least. In practice, for the driver to learn the
correct shard, it first needs to learn the token->shard mapping from
the server. For vnodes it can expect a shard by calculating the token
of the affected partition, but for tablets, it had no opportunity to
learn the tablet->shard mapping so the first write may route to any shard.
Additionally, we aren't guaranteed that the driver established connections
to all shards on all nodes at the point of any write. So if a connection
finishes establishing between the two writes, this may also cause us to
coordinate these 2 writes on different shards, leading to a missed view
backlog growth and not-rejected second write.

We fix this in this patch by running the test using one shard on each node.
This way, as long as we perform both writes on the same node, they'll also
be coordinated on the same shard. This also makes the prepared statement and
BoundStatement unnecessary — we can use SimpleStatement with
FallthroughRetryPolicy directly.

Fixes: SCYLLADB-1901

Closes scylladb/scylladb#29862
2026-05-12 17:34:19 +02:00
Piotr Dulikowski
129f193116 Merge 'strong_consistency: implement basic coordinator metrics' from Michał Jadwiszczak
Add per-shard metrics for strong consistency coordinator operations (latency, timeouts, bounces, status unknown) under the `"strong_consistency_coordinator"` category. These are analogous to the eventual consistency metrics in `storage_proxy_stats`, enabling direct performance comparison between the two consistency modes.

The metrics are simplified compared to `storage_proxy_stats` — no breakdown by table, tablet, scheduling group, or DC, only per-shard.

Fixes SCYLLADB-1343

Strong consistency is still in experimental phase, no need to backport.

Closes scylladb/scylladb#29318

* github.com:scylladb/scylladb:
  test/strong_consistency: verify metrics
  strong_consistency: wire up metrics to operations
  strong_consistency: add stats struct and metrics registration
2026-05-12 16:15:51 +02:00
Botond Dénes
e95eb21a16 Merge 'Tablet-aware restore' from Pavel Emelyanov
The mechanics of the restore is like this

- A /storage_service/tablets/restore API is called with (keyspace, table, endpoint, bucket, manifests) parameters
  - First, it populates the system_distributed.snapshot_sstables table with the data read from the manifests
  - Then it emplaces a bunch of tablet transitions (of a new "restore" kind), one for each tablet
- The topology coordinator handles the "restore" transition by calling a new RESTORE_TABLET RPC against all the current tablet replicas
- Each replica handles the RPC verb by
  - Reading the snapshot_sstables table
  - Filtering the read sstable infos against current node and tablet being handled
  - Downloading and attaching the filtered sstables

This PR includes system_distributed.snapshot_sstables table from @robertbindar and preparation work from @kreuzerkrieg that extracts raw sstables downloading and attaching from existing generic sstables loading code.

This is first step towards SCYLLADB-197 and lacks many things. In particular
- the API only works for single-DC cluster
- the caller needs to "lock" tablet boundaries with min/max tablet count
- not abortable
- no progress tracking
- sub-optimal (re-kicking API on restore will re-download everything again)
- not re-attacheable (if API node dies, restoration proceeds, but the caller cannot "wait" for it to complete via other node)
- nodes download sstables in maintenance/streaming sched gorup (should be moved to maintenance/backup)

Other follow-up items:
- have an actual swagger object specification for `backup_location`

Closes #28436
Closes #28657
Closes #28773

Closes scylladb/scylladb#28763

* github.com:scylladb/scylladb:
  docs: Update topology_over_raft.md with `restore` transition kind
  test: Add test for backup vs migration race
  test: Restore resilience test
  sstables_loader: Fail tablet-restore task if not all sstables were downloaded
  sstables_loader: mark sstables as downloaded after attaching
  sstables_loader: return shared_sstable from attach_sstable
  db: add update_sstable_download_status method
  db: add downloaded column to snapshot_sstables
  db: extract snapshot_sstables TTL into class constant
  test: Add a test for tablet-aware restore
  tablets: Implement tablet-aware cluster-wide restore
  messaging: Add RESTORE_TABLET RPC verb
  sstables_loader: Add method to download and attach sstables for a tablet
  tablets: Add restore_config to tablet_transition_info
  sstables_loader: Add restore_tablets task skeleton
  test: Add rest_client helper to kick newly introduced API endpoint
  api: Add /storage_service/tablets/restore endpoint skeleton
  sstables_loader: Add keyspace and table arguments to manfiest loading helper
  sstables_loader_helpers: just reformat the code
  sstables_loader_helpers: generalize argument and variable names
  sstables_loader_helpers: generalize get_sstables_for_tablet
  sstables_loader_helpers: add token getters for tablet filtering
  sstables_loader_helpers: remove underscores from struct members
  sstables_loader: move download_sstable and get_sstables_for_tablet
  sstables_loader: extract single-tablet SST filtering
  sstables_loader: make download_sstable static
  sstables_loader: fix formating of the new `download_sstable` function
  sstables_loader: extract single SST download into a function
  sstables_loader: add shard_id to minimal_sst_info
  sstables_loader: add function for parsing backup manifests
  split utility functions for creating test data from database_test
  export make_storage_options_config from lib/test_services
  rjson: Add helpers for conversions to dht::token and sstable_id
  Add system_distributed_keyspace.snapshot_sstables
  add get_system_distributed_keyspace to cql_test_env
  code: Add system_distributed_keyspace dependency to sstables_loader
  storage_service: Export export handle_raft_rpc() helper
  storage_service: Export do_tablet_operation()
  storage_service: Split transit_tablet() into two
  tablets: Add braces around tablet_transition_kind::repair switch
2026-05-12 16:24:13 +03:00
Yaniv Michael Kaul
c359a09189 test: add UDF/UDA keyspace isolation and UDT tests
Port 3 tests from scylla-dtest user_functions_test.py:
- test_udf_with_udt: UDF taking frozen UDT arg, verifies DROP TYPE blocked
- test_udf_with_udt_keyspace_isolation: cross-keyspace UDT references rejected
- test_aggregate_with_udt_keyspace_isolation: cross-keyspace UDT in UDA rejected

All tests use Lua (Scylla's supported UDF language).
Reproduces CASSANDRA-9409.

Closes scylladb/scylladb#1928

Closes scylladb/scylladb#29843
2026-05-12 14:57:14 +03:00
Yaniv Michael Kaul
f55a55fbf3 docker: fix coredump collection when host uses pipe-based core_pattern
The container image inherits kernel.core_pattern from the host.  When
the host pipes core dumps to a handler (e.g. Ubuntu's apport), that
handler does not exist or work correctly inside the container, so core
dumps are silently lost.

Override any pipe-based core_pattern with a file-based pattern that
writes directly to /var/lib/scylla/coredump/.  The override is attempted
both from the entrypoint (scyllasetup.coredumpSetup) and from
scylla-server.sh when running as root; it succeeds only when the
container has write access to /proc/sys/kernel/core_pattern and is
silently skipped otherwise.

Fixes: SCYLLADB-1366

Closes scylladb/scylladb#29337
2026-05-12 14:16:22 +03:00
Piotr Smaron
1018710e38 test/cqlpy: un-xfail oversized indexed value build test
Issue #8627 is fixed, so test_too_large_indexed_value_build now passes and should run normally instead of XPASSing under strict xfail.

Fixes: SCYLLADB-1938

Closes scylladb/scylladb#29853
2026-05-12 11:40:53 +02:00
Avi Kivity
ddb1181103 Merge 'load_balance: fix drain with forced capacity-based balancing' from Ferenc Szili
When `force_capacity_based_balancing` is enabled and a node is being drained/excluded, the tablet allocator incorrectly aborts balancing due to incomplete tablet stats - even though capacity-based balancing doesn't depend on tablet sizes.

The tablet allocator normally waits for complete load stats before balancing. An exception exists for drained+excluded nodes (they're unreachable and won't return stats). However, when forced capacity-based balancing is active, this exception was not being applied, causing the balancer to reject the drain plan.

Adjust the condition in `tablet_allocator.cc` so that the "ignore missing data for drained nodes" logic applies regardless of whether capacity-based balancing is forced.

Added a Boost unit test that forces capacity-based balancing and verifies a drained/excluded node gets its tablets migrated even when tablet size stats are missing.

This bug was introduced in 2026.1, so this needs to be backported to 2026.1 and 2026.2

Fixes: SCYLLADB-1803

Closes scylladb/scylladb#29791

* github.com:scylladb/scylladb:
  test: boost: add drain test for forced capacity-based balancing
  service: allow draining with forced capacity-based balancing
2026-05-12 12:38:25 +03:00
Andrzej Jackowski
89261bf759 test: wait for TTL scheduling sanity metric
The test samples sl:default runtime before and after setup writes to
prove that it measures the scheduling group used by regular CQL writes.
The metric is exported in milliseconds, so a single 200-row batch may
not be visible immediately, or may be too small in some environments.

Keep the original 200-row table size, but wait up to 30 seconds for the
metric to advance. If it does not, retry the same writes before TTL is
enabled. The retries update the same keys, so the expiration part of the
test still waits for exactly the original number of rows.

In a local 100-run with N=200 rows, the observed delta of
`ms_statement_before - ms_statement_before_write` was: min=4.0,
max=16.0, mean=8.13, and median=8.0. Therefore, it looks possible that
in a rare corner case the delta drops even to 0.

Fixes SCYLLADB-1869

Closes scylladb/scylladb#29797
2026-05-12 12:38:25 +03:00
Avi Kivity
6fca064ac8 Merge 'alternator: a couple of small cleanups suggested by copilot' from Nadav Har'El
The first patch improves the input validation of  the CONTAINS operator. I believe this is not a critical fix, because RapidJSON already has exception-throwing RAPIDJSON_ASSERT() that check for unexpected JSON structure (like something we expect to be a list isn't actually a list), but it's cleaner to do these checks explicitly.

The second patch just removes an unnecessary call to format() on a constant string.

Closes scylladb/scylladb#28506

* github.com:scylladb/scylladb:
  alternator: remove unneeded call to format()
  alternator: improve CONTAINS operator's validity checking
2026-05-12 12:38:25 +03:00
Botond Dénes
8d6f031a4a schema: fix DESCRIBE showing NullCompactionStrategy when compaction is disabled
When a table's compaction is disabled via 'enabled': 'false', the DESCRIBE
output incorrectly showed NullCompactionStrategy instead of the actual strategy.
This happened because schema_properties() called compaction_strategy(), which
returns compaction_strategy_type::null when compaction is disabled. Fix it by
using configured_compaction_strategy(), which always returns the real strategy
type - consistent with how schema_tables.cc serializes it to disk.

Fixes SCYLLADB-1353

Closes scylladb/scylladb#29804
2026-05-12 12:38:25 +03:00
Piotr Dulikowski
7c2b1ea0b5 Merge 'view_building: fix tombstone_warn_threshold warnings' from Michał Jadwiszczak
`system.view_building_tasks` is a single-partition Raft group0 table (pk = `"view_building"`, CK = timeuuid). When `clean_finished_tasks()` deletes hundreds of finished tasks, the physical rows remain in SSTables until compaction. Any subsequent read of the partition counts every column of every tombstoned row
  as a dead cell, triggering `tombstone_warn_threshold` warnings in large clusters.

Two-part fix:

**1. Range tombstones instead of row tombstones (commits 2–3)**

Instead of one row tombstone per finished task, find the minimum alive task UUID (`min_alive_uuid`) and emit a single range tombstone `[before_all, min_alive_uuid)` covering all tasks below that boundary. This reduces the tombstone count significantly and also benefits future compaction.

**2. Bounded scan with `min_task_id` (commits 4–6)**

Even with range tombstones, physical rows remain until compaction and still count as dead cells during reads. The only way to avoid them is to not read them at all.

   - Add a `min_task_id timeuuid` static column to `system.view_building_tasks`.
   - On every GC, write `min_task_id = min_alive_uuid` atomically with the range tombstone (same Raft batch).
   - On reload, read `min_task_id` first using a **static-only partition slice** (empty `_row_ranges` + `always_return_static_content`): the SSTable reader stops immediately after the static row before processing any clustering tombstones — zero dead cells counted.
   - Use `AND id >= min_task_id` as a lower bound for the main task scan, skipping all tombstoned rows.

The static-only read and the bounded scan are gated on the `VIEW_BUILDING_TASKS_MIN_TASK_ID` cluster feature so mixed-version clusters fall back to the full scan.

The issue is not critical, so the fix shouldn't be backported.

Fixes SCYLLADB-657

Closes scylladb/scylladb#28929

* github.com:scylladb/scylladb:
  test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning
  docs: document tombstone avoidance in view_building_tasks
  view_building: add `task_uuid_generator` to `view_building_task_mutation_builder`
  view_building: introduce `task_uuid_generator`
  view_building: store `min_alive_uuid` in view building state
  view_building: set min_task_id when GC-ing finished tasks
  view_building: add min_task_id support to view_building_task_mutation_builder
  view_building: add min_task_id static column and bounded scan to system_keyspace
  view_building: use range tombstone when GC-ing finished tasks
  view_building: add range tombstone support to view_building_task_mutation_builder
  view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
2026-05-12 12:38:25 +03:00
Avi Kivity
cf50f0191a encryption: fix deprecated input_stream/output_stream usage in KMIP connection
Seastar deprecated default-constructing input_stream and output_stream
(they are useless in that state), and also deprecated move-assigning
them after the fact.

Fix by wrapping both fields in std::optional, and using emplace() to
construct them in-place once the connected socket is available.

It would be nicer to make connect() a static method that returns
a connection, but that's a larger change.

Closes scylladb/scylladb#29627
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
1c0f8ab66e Merge 'sstables: introduce --abort-on-malformed-sstable-error' from Botond Dénes
When a malformed sstable error occurs, it is usually caused by actual sstable corruption — a cosmic ray, a bad disk write, etc. However, it can also be caused by memory corruption, where a data structure in memory happens to be read as sstable data. In the latter case, having a coredump of the process at the moment of the error is invaluable for post-mortem debugging, since the exception throwing/catching machinery destroys the stack frames that would point to the corruption site.

This patch series introduces `--abort-on-malformed-sstable-error`, a new command-line option (with `LiveUpdate` support) that, when set, causes the server to call `std::abort()` instead of throwing an exception whenever any sstable parse error is detected. This covers all code paths:

- Direct `throw malformed_sstable_exception(...)` sites (migrated to `throw_malformed_sstable_exception()`)
- Direct `throw bufsize_mismatch_exception(...)` sites (migrated to `throw_bufsize_mismatch_exception()`)
- `parse_assert()` failures (via `on_parse_error()`)
- BTI parse errors (via `on_bti_parse_error()`)

The implementation places the flag and helper functions in `sstables/sstables.cc`, next to the existing `on_parse_error()` / `on_bti_parse_error()` infrastructure.

The flag defaults to `false`, preserving current behaviour. It is intended to be enabled temporarily when investigating suspected memory corruption.

**Commit breakdown:**
1. Infrastructure: flag, getter/setter, and throw helpers in `sstables/sstables.cc`; config option wired up in `main.cc`
2. `on_parse_error()` and `on_bti_parse_error()` check the new flag
3. All ~50 `throw malformed_sstable_exception(...)` sites migrated
4. Both `throw bufsize_mismatch_exception(...)` sites migrated

Refs: SCYLLADB-1087
Backport: new feature, no backport

Closes scylladb/scylladb#29324

* github.com:scylladb/scylladb:
  sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
  sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
  sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
  sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
  sstables: introduce --abort-on-malformed-sstable-error infrastructure
  sstables: refactor parse_path() to return std::expected<> instead of throwing
2026-05-12 12:38:25 +03:00
Pavel Emelyanov
150345cc52 Merge 'test: per-bucket isolation for S3/GCS object storage tests' from Ernest Zaslavsky
This series adds per-test bucket isolation to all S3 and GCS object storage tests. Previously, every test shared a single pre-created bucket, which meant tests could interfere with each other through leftover objects and could not run concurrently across multiple `test.py` processes without risking collisions.

New `create_bucket`, `delete_bucket`, and `delete_bucket_with_objects` methods on `s3::client`, following the existing `make_request` pattern. `create_bucket` handles the `BUCKET_ALREADY_OWNED_BY_YOU` error gracefully.

A new `s3_test_fixture` RAII class for C++ Boost tests that creates a uniquely-named bucket on construction (derived from the Boost test name and pid) and tears down everything — objects, bucket, client — on destruction. All S3 tests in `s3_test.cc` are migrated to use it, removing manual `deferred_delete_object` and `deferred_close` boilerplate. The minio server policy is broadened to allow dynamic bucket creation/deletion.

A `client::make` overload that accepts a custom `retry_strategy`, used in tests with a fast 1ms retry delay instead of exponential backoff, significantly reducing test runtime for transient errors during bucket lifecycle operations.

Python-side (`test/cluster/object_store`): each pytest fixture (`object_storage`, `s3_storage`, `s3_server`) now creates a unique bucket per test function via `create_test_bucket()` and destroys it on teardown. Bucket names are sanitized from the pytest node name with a short UUID suffix for uniqueness.

Object storage helpers (`S3Server`, `MinioWrapper`, `GSFront`, `GSServerImpl`, factory functions, CQL helpers, `s3_server` fixture) are extracted from `test/cluster/object_store/conftest.py` into a shared `test/pylib/object_storage.py` module, eliminating duplication across test suites. The conftest becomes a thin re-export wrapper. Old class names are preserved as aliases for backward compatibility.

| Test Name                                                    | new test specific retry strategy execution time (ms) | original execution time (ms) |   Δ (ms) | Speedup |
|--------------------------------------------------------------|----------------:|-------------:|---------:|--------:|
| test_client_upload_file_multi_part_with_remainder_proxy      |          19,261 |       61,395 | −42,134  | **3.2×** |
| test_client_upload_file_multi_part_without_remainder_proxy   |          16,901 |       53,688 | −36,787  | **3.2×** |
| test_client_upload_file_single_part_proxy                    |           3,478 |        6,789 |  −3,311  | **2.0×** |
| test_client_multipart_copy_upload_proxy                      |           1,303 |        1,619 |    −316  | 1.2×    |
| test_client_put_get_object_proxy                             |             150 |          365 |    −215  | **2.4×** |
| test_client_readable_file_stream_proxy                       |             125 |          327 |    −202  | **2.6×** |
| test_small_object_copy_proxy                                 |             205 |          389 |    −184  | 1.9×    |
| test_client_put_get_tagging_proxy                            |             181 |          350 |    −169  | 1.9×    |
| test_client_multipart_upload_proxy                           |           1,252 |        1,416 |    −164  | 1.1×    |
| test_client_list_objects_proxy                               |             729 |          881 |    −152  | 1.2×    |
| test_chunked_download_data_source_with_delays_proxy          |             830 |          960 |    −130  | 1.2×    |
| test_client_readable_file_proxy                              |             148 |          279 |    −131  | 1.9×    |
| test_client_upload_file_multi_part_with_remainder_minio      |           3,358 |        3,170 |    +188  | 0.9×    |
| test_client_upload_file_multi_part_without_remainder_minio   |           3,131 |        2,929 |    +202  | 0.9×    |
| test_client_upload_file_single_part_minio                    |             519 |          421 |     +98  | 0.8×    |
| test_download_data_source_proxy                              |             180 |          237 |     −57  | 1.3×    |
| test_client_list_objects_incomplete_proxy                     |             590 |          641 |     −51  | 1.1×    |
| test_large_object_copy_proxy                                 |             952 |          991 |     −39  | 1.0×    |
| test_client_multipart_upload_fallback_proxy                  |             148 |          185 |     −37  | 1.3×    |
| test_client_multipart_copy_upload_minio                      |             641 |          674 |     −33  | 1.1×    |

No backport needed — this is a test infrastructure improvement with no production code impact beyond the new `s3::client` methods.

Closes scylladb/scylladb#29508

* github.com:scylladb/scylladb:
  test: extract object storage helpers to test/pylib/object_storage.py
  test: add per-test bucket isolation to object_store fixtures
  s3: add client::make overload with custom retry strategy
  test: add s3_test_fixture and migrate tests to per-bucket isolation
  s3: add create_bucket and delete_bucket to client
2026-05-12 12:38:24 +03:00
Dimitrios Symonidis
94bc0245f9 sstables, utils/s3: reuse caller-provided file in s3_storage::make_source
s3_storage::make_source previously ignored its file f parameter and
constructed a fresh s3::client::readable_file per call. The new
file's _stats cache was empty, so the first dma_read_bulk issued a
HEAD via maybe_update_stats just to learn the object size before
the ranged GET -- one ~50 ms RTT per uncached read.

The file f passed in by the two callers (sstable::data_stream for
Data.db reads and index_reader::make_context for Index.db reads)
already wraps the sstable's _data_file or _index_file. Those file
objects had their stats populated at sstable open time by
update_info_for_opened_data, and they were wrapped with the
configured file_io_extensions when opened via open_component. Reusing
them is exactly what filesystem_storage::make_source does (one-line
make_file_data_source over f), so the s3 path simply matches it.

readable_file::size() is also updated to route through
maybe_update_stats(), so a .size() call populates the _stats cache
the same way .stat() does -- preventing a redundant HEAD on the
first subsequent read of components opened with .size() (Index,
Partitions, Rows in update_info_for_opened_data).

Closes scylladb/scylladb#29766

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 12:38:24 +03:00
Pavel Emelyanov
896de77b99 docs: Update topology_over_raft.md with restore transition kind
Add some text about how the new transition works. It doesn't include
full feature description, just concentrates on the new transition and
the way it interacts with the rest of topology coordinator machinery.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
19820910f8 test: Add test for backup vs migration race
The test starts regular backup+restore on a smaller cluster, but prior
to it spawns tablet migration from one node to another and locks it in
the middle with the help of block_tablet_streaming injection (even
though tablets have no data and there's nothing to stream, the injection
is located early enough to work).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
3bcefa42c5 test: Restore resilience test
The test checks that losing one of nodes from the cluster while restore
is handled. In particular:

- losing an API node makes the task waiting API to throw (apparently)
- losing coordinator or replica node makes the API call to fail, because
  some tablets should fail to get restored. If the coordinator is lost,
  it triggers coordinator re-election and new coordinator still notices
  that a tablet that was replicated to "old" coordinator failed to get
  restored and fails the restore anyway

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Pavel Emelyanov
69b8f76a32 sstables_loader: Fail tablet-restore task if not all sstables were downloaded
When the storage_service::restore_tablets() resolves, it only means that
tablet transitions are done, including restore transitions, but not
necessarily that they succeeded. So before resolving the restoration
task with success need to check if all sstables were downloaded and, if
not, resolve the task with exception.

Test included. It uses fault-injection to abort downloading of a single
sstable early, then checks that the error was properly propagated back
to the task waiting API

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
bdc5976bcd sstables_loader: mark sstables as downloaded after attaching
After each SSTable is successfully attached to the local table in
download_tablet_sstables(), update its downloaded status in
system_distributed.snapshot_sstables to true. This enables tracking
restore progress by counting how many SSTables have been downloaded.
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
0d8de9becd sstables_loader: return shared_sstable from attach_sstable
Change attach_sstable() return type from future<> to
future<sstables::shared_sstable>, returning the SSTable that was
attached. This will be used to extract the SSTable identifier and
first token for updating the download status.
2026-05-12 10:40:24 +03:00
Ernest Zaslavsky
7eb921a142 db: add update_sstable_download_status method
Add a method to update the downloaded status of a specific SSTable
entry in system_distributed.snapshot_sstables. This will be used
by the tablet restore process to mark SSTables as downloaded after
they have been successfully attached to the local table.
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
83ec7e22b9 db: add downloaded column to snapshot_sstables
Add a 'downloaded' boolean column to the snapshot_sstables table
schema and the corresponding field to the snapshot_sstable_entry
struct. Update insert_snapshot_sstable() and get_snapshot_sstables()
to write and read this column.
This column will be used to track which SSTables have been
successfully downloaded during a tablet restore operation.

Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
61c627a7c0 db: extract snapshot_sstables TTL into class constant
Move the TTL value used for snapshot_sstables rows from a local
variable in insert_snapshot_sstable() to a class-level constant
SNAPSHOT_SSTABLES_TTL_SECONDS, making it reusable by other methods.
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
4137211cf4 test: Add a test for tablet-aware restore
The test is derived from test_restore_with_streaming_scopes() one, with
the excaption that it doesn't check for streaming directions, doesn't
check mutations right after creation and doesn't loop over scoped
sub-tests, because there's no scope concept here.

Also it verifies just two topologies, it seems to be enough. The scopes
test has many topologies because of the nature of the scoped restore,
with cluster-wide restore such flexibility is not required.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
17384d42e3 tablets: Implement tablet-aware cluster-wide restore
This patch adds

- Changes in sstables_loader::restore_tablets() method

It populates the system_distributed_keyspace.snapshot_sstables table
with the information read from the manifest

- Implementation of tablet_restore_task_impl::run() method

It emplaces a bunch of tablet migrations with "restore" kind

- Topology coordinator handling of tablet_transition_stage::restore

When seen, the coordinator calls RESTORE_TABLET RPC against all tablet
replicas

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
39ae59da9c messaging: Add RESTORE_TABLET RPC verb
The topology coordinator will need to call this verb against existing
tablet replicas to ask them restore tablet sstables. Here's the RPC verb
to do it.

It now returns an empty restore_result to make it "synchronous" -- the
co_await send_restore_tablets() won't resolve until client call
finishes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
8514b73f4b sstables_loader: Add method to download and attach sstables for a tablet
Extracts the data from snapshot_sstables tables and filters only
sstables belonging to current node and tablet in question, then starts
downloading the matched sstables

Extracted from Ernest PR #28701 and piggy-backs the refactoring from
another Ernest PR #28773. Will be used by next patches.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
cf21471391 tablets: Add restore_config to tablet_transition_info
When doing cluster-wide restore using topology coordinator, the
coordinator will need to serve a bunch of new tablet transition kinds --
the restore one. For that, it will need to receive information about
from where to perform the restore -- the endpoint and bucket pair. This
data can be grabbed from nowhere but the tablet transition itself, so
add the "restore_config" member with this data.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
2eaa9035df sstables_loader: Add restore_tablets task skeleton
The new cluster-wide tablets restore API is going to be asynchronous,
just like existing node-local one is. For that the task_manager tasks
will be used.

This patch adds a skeleton for tablets-restore task with empty run
method. Next patches will populate it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
dcd490666b test: Add rest_client helper to kick newly introduced API endpoint
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
5f235e105a api: Add /storage_service/tablets/restore endpoint skeleton
Withdrawn from #28701. The endpoint implementation from the PR is going
to be reworked, but the swagger description and set/unset placeholders
are very useful.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Ernest Zaslavsky <ernest.zaslavsky@scylladb.com>
2026-05-12 10:40:23 +03:00
Pavel Emelyanov
d280987f2c sstables_loader: Add keyspace and table arguments to manfiest loading helper
When restoring a backup into a keyspace under a different name, than the
one at which it existed during backup, the snapshot_sstables table must
be populated with the _new_ keyspace name, not the one taken from
manifest. Same is true for table name.

This patch makes it possible to override keyspace/table loaded from
manifest file with the provided values. in the future it will also be
good to check that if those values are not provided by user, then values
read from different manifest files are the same.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:23 +03:00
Ernest Zaslavsky
e0f4813c2f sstables_loader_helpers: just reformat the code
Reformat get_sstables_for_tablet to wrap extremely long line
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
19554466f6 sstables_loader_helpers: generalize argument and variable names
Rename arguments and local variables in get_sstables_for_tablet to avoid
references to SSTable-specific terminology. This makes the function more
generic and better suited for reuse with different range types.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
2e37f9dc90 sstables_loader_helpers: generalize get_sstables_for_tablet
Generalize get_sstables_for_tablet by templating the return type so it
produces vectors matching the input range’s value type. This makes the
function more flexible and prepares it for reuse in tablet‑aware
restore.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
17b415ccde sstables_loader_helpers: add token getters for tablet filtering
Add getters for the first and last tokens in get_sstables_for_tablet to
make the function more generic and suitable for future use in the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
1150f7cf24 sstables_loader_helpers: remove underscores from struct members
Remove underscores from minimal_sst_info struct members to comply with
our coding guidelines.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
aa00048753 sstables_loader: move download_sstable and get_sstables_for_tablet
Move the download_sstable and get_sstables_for_tablet static functions
from sstables_loader into a new file to make them reusable by the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
991576ed73 sstables_loader: extract single-tablet SST filtering
Extract single-tablet range filtering into a new
get_sstables_for_tablet function, taken from the existing
get_sstables_for_tablets. This will later be reused in the
tablet-aware restore code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
b0f6cbb2a4 sstables_loader: make download_sstable static
Make the download_sstable function static to prepare it for extraction
as a helper function that will later be reused in tablet-aware restore.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
60dd7de4b8 sstables_loader: fix formating of the new download_sstable function
Just fix formatting of the new `download_sstable` function
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
9efc658bdd sstables_loader: extract single SST download into a function
Extract the logic for downloading a single SST into a dedicated
function and reuse it in download_fully_contained_sstables. This
supports upcoming changes that consolidate common code.
2026-05-12 10:40:22 +03:00
Ernest Zaslavsky
fd2043cad8 sstables_loader: add shard_id to minimal_sst_info
Add a shard_id member to the minimal_sst_info struct as part of the
tablet-aware restore refactoring. This will support upcoming changes
that extract common code.
2026-05-12 10:40:22 +03:00
Robert Bindar
c97232bb7b sstables_loader: add function for parsing backup manifests
This change adds functionality for parsing backup manifests
and populating system_distributed.snapshot_sstables with
the content of the manifests.
This change is useful for tablet-aware restore. The function
introduced here will be called by the coordinator node
when restore starts to populate the snapshot_sstables table
with the data that workers need to execute the restore process.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Co-authored-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:40:22 +03:00
Robert Bindar
f0e8d6c9dd split utility functions for creating test data from database_test
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
b52e40e512 export make_storage_options_config from lib/test_services
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
9c3abbb8f5 rjson: Add helpers for conversions to dht::token and sstable_id
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
2f19d84ad7 Add system_distributed_keyspace.snapshot_sstables
This patch adds the snapshot_sstables table with the following
schema:
```cql
CREATE TABLE system_distributed.snapshot_sstables (
    snapshot_name text,
    keyspace text, table text,
    datacenter text, rack text,
    id uuid,
    first_token bigint, last_token bigint,
    toc_name text, prefix text)
  PRIMARY KEY ((snapshot_name, keyspace, table, datacenter, rack), first_token, id);
```
The table will be populated by the coordinator node during the restore
phase (and later on during the backup phase to accomodate live-restore).
The content of this table is meant to be consumed by the restore worker nodes
which will use this data to filter and file-based download sstables.

Fixes SCYLLADB-263

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:40:21 +03:00
Robert Bindar
31e9f04714 add get_system_distributed_keyspace to cql_test_env
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
90ff7c5de3 code: Add system_distributed_keyspace dependency to sstables_loader
The loader will need to populate and read data from
system_distributed.snapshot_sstables table added recently, so this
dependency is truly needed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
2c60d8f897 storage_service: Export export handle_raft_rpc() helper
Just like do_tablet_operation, this one will be used by sstables_loader
restore-tablet RPC

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
1c0e04316b storage_service: Export do_tablet_operation()
Next patches will introduce an RPC handler to restore a tablet on
replica. The handler will be registered by sstables_loader, and it will
have to call that helper from storage_service which thus needs to be
moved to public scope.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:40 +03:00
Pavel Emelyanov
e5f04b0927 storage_service: Split transit_tablet() into two
The goal of the split is to have try_transit_tablet() that

- doesn't throw if tablet is in transition, but reports it back
- doesn't wait for the submitted transition to finish

The user will be in tablet-aware-restore, it will call this new trying
helper in parallel, then wait for all transitions to finish.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:39 +03:00
Pavel Emelyanov
dd51acf014 tablets: Add braces around tablet_transition_kind::repair switch
This is just to reduce the churn in the next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-05-12 10:17:39 +03:00
Botond Dénes
866afe4c1e Merge ' db: add large data metrics for rows, cells, and collections' from Taras Veretilnyk
- Add `large_rows_exceeding_threshold`, `large_cell_exceeding_threshold`, and `large_collection_exceeding_threshold` metrics to complement the existing `large_partition_exceeding_threshold`
- Add unit tests verifying stats counters increment correctly during SSTable writes

Backport is not needed

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1095

Closes scylladb/scylladb#29722

* github.com:scylladb/scylladb:
  test/boost: add tests for large data stats counters
  db: add large data metrics for rows, cells, and collections
2026-05-12 10:04:53 +03:00
Pavel Emelyanov
30f1075544 utils: Replace local memory sink/source with seastar equivalents
Replace the local buffer_data_sink_impl and buffer_data_source_impl
classes in create_memory_sink() and create_memory_source() with
seastar::util::memory_data_sink and seastar::util::memory_data_source
respectively, which are now available upstream.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29616
2026-05-12 08:47:43 +03:00
Taras Veretilnyk
47b4fa920d test/boost: add tests for large data stats counters
Add test_large_data_stats_large_rows, test_large_data_stats_large_cells,
and test_large_data_stats_large_collections to verify that the
large_data_handler stats counters are correctly incremented during
SSTable writes and that unrelated counters remain at zero.
2026-05-11 23:42:14 +02:00
Taras Veretilnyk
881776b441 db: add large data metrics for rows, cells, and collections
Previously only large_partition_exceeding_threshold was exposed as a
metric. Add three new counters to large_data_handler::stats and register
corresponding Prometheus metrics:
- large_rows_exceeding_threshold
- large_cell_exceeding_threshold
- large_collection_exceeding_threshold

The counters are incremented in maybe_record_large_rows() and
maybe_record_large_cells() following the same pattern used by the
existing partition metric.
2026-05-11 23:11:17 +02:00
Anna Stuchlik
1f7d20f701 doc: label Migration from Vnodes to Tablets as experimental
The procedure to migrate a vnodes-based keyspace to tablets-based keyspace
has been labeled as experimental.

Fixes SCYLLADB-1932

Closes scylladb/scylladb#29834
2026-05-11 17:07:39 +03:00
Yaniv Michael Kaul
377bbeb076 docs: fix invalid UUID characters in examples
Replace UUIDs containing non-hexadecimal characters (like 'g', 'n', 'y')
with valid UUIDs in documentation examples.

Fixes #26797

Closes scylladb/scylladb#29674
2026-05-11 17:05:30 +03:00
Calle Wilund
2cc1a2c406 storage_service: Disable snapshots after raft decommission
Fixes: SCYLLADB-1693

In case we abort a decommission operation, the snapshot/backup
mechanism need to remain open.

This change moves it to after raft_decommission.

In the case of a cluster snapshot, our nodes ownership
or not of tables will be serialized by raft anyway, so
should remain consistent. In that case we at worst coordinate
from a node in "leave" status

In the case of a local snapshot, ownership matters less,
only sstables on disk, which should not change.

In the case of backup, this operates on a snapshot, state of which
is not affected.

Adds an injection point for testing.

v2:
- Added injection point to ensure test can abort decommission

Closes scylladb/scylladb#29667
2026-05-11 17:04:09 +03:00
Anna Stuchlik
4c01556f79 doc: mark Vector Search in Alternator as Cloud-only
This commit adds the information missing from the Alternator docs
that Vector Search is only available in ScyllaDB Cloud.

Fixes https://github.com/scylladb/scylladb/issues/29661

Closes scylladb/scylladb#29664
2026-05-11 17:03:20 +03:00
Avi Kivity
f5ffbd3c3e cql3: restrictions: reindent statement_restrictions.cc
6165124fcc has left statement_restrictions.cc scarred and
deformed. Restore it to standard 4-space indentation. This patch
contains only whitespace changes.

Closes scylladb/scylladb#29598
2026-05-11 17:02:14 +03:00
Yaniv Michael Kaul
3cba27d25f topology: propagate error messages through raft_topology_cmd_result
When a topology command (e.g., rebuild) fails on a target node, the
exception message was being swallowed at multiple levels:

1. raft_topology_cmd_handler caught exceptions and returned a bare
   fail status with no error details.
2. exec_direct_command_helper saw the fail status and threw a generic
   "failed status returned from {id}" message.
3. The rebuilding handler caught that and stored a hardcoded
   "streaming failed" message.

This meant users only saw "rebuild failed: streaming failed" instead
of the actionable error from the safety check (e.g., "it is unsafe
to use source_dc=dc2 to rebuild keyspace=...").

Fix by:
- Adding an error_message field to raft_topology_cmd_result (with
  [[version 2026.2]] for wire compatibility).
- Populating error_message with the exception text in the handler's
  catch blocks.
- Including error_message in the exception thrown by
  exec_direct_command_helper.
- Passing the actual error through to rtbuilder.done() instead of
  the hardcoded "streaming failed".

A follow-up test is in https://github.com/scylladb/scylladb/pull/29363

Fixes: SCYLLADB-1404

Closes scylladb/scylladb#29362
2026-05-11 17:01:15 +03:00
Yaniv Michael Kaul
cf9cde664c .github/workflows/call_sync_milestone_to_jira.yml: add missing workflow permissions
Add explicit empty permissions block (permissions: {}) since this
workflow only syncs milestones to Jira using its own secrets and needs
no GITHUB_TOKEN permissions. Fixes code scanning alert #171.

Closes scylladb/scylladb#29184
2026-05-11 17:00:10 +03:00
Raphael S. Carvalho
20fe1e6f68 replica: Improve diagnostics when tablet split fails due to non-empty split-unready groups
When finalizing a tablet split, all data must have been moved into
split-ready compaction groups before the storage groups can be remapped
to the new tablet count. If split-unready groups still hold data at that
point, handle_tablet_split_completion() calls on_internal_error(), which
previously only reported the tablet and table IDs — giving no insight
into why the split-unready groups were not empty.

Add fmt::formatter specializations for compaction_group and storage_group
so the full state of the offending storage_group is included in the error
message. The storage_group formatter emits:

  main=<cg>, merging=[<cg>...], split_ready=[<cg>...]

Each compaction_group formatter emits:

  [sstables=[<sstable_desc>...], memtable_empty=<bool>, sstable_add_gate=<count>]

where sstable_desc includes filename, origin, identifier and originating
host, memtable_empty reflects whether all memtables have been flushed,
and sstable_add_gate count reveals whether an in-flight sstable add is
holding data in the group.

Supporting changes:

- compaction_group: add memtable_empty() const noexcept (delegates to
  memtable_list::empty()) and a const overload of sstable_add_gate()
  so both are accessible from a const compaction_group reference inside
  the formatter.
- Promote sstable_desc from a local lambda in compaction_group_for_sstable
  to a static free function so it is reusable by the formatter.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-1019.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29178
2026-05-11 16:59:05 +03:00
Yaniv Michael Kaul
3674deea54 scylla-gdb: display ms-format sstable summary from partitions db footer
For ms-format (trie-based) sstables, the traditional summary structure
is not populated. Instead, read equivalent metadata from the
_partitions_db_footer field: first_key, last_key, partition_count,
and trie_root_position.

This is a follow-up to the crash fix for SCYLLADB-1180, replacing the
informational-only message with actual useful output.

Refs: SCYLLADB-1180

Closes scylladb/scylladb#29164
2026-05-11 16:58:22 +03:00
Calle Wilund
db1b92c185 service::load_balancer: Add metrics for repair and rebuild count
Fixes #21115

Adds cluster counter for repairs, and dc counter for rebuilds

Closes scylladb/scylladb#28985
2026-05-11 16:57:46 +03:00
Piotr Smaron
71542206bc cql: return InvalidRequest for oversized partition/clustering keys
When a partition key or clustering key value exceeds the 64 KiB limit
(65535 bytes serialized), Scylla used to raise a generic
std::runtime_error "Key size too large: N > M" from the low-level
compound-key serializer. That error surfaced to clients as a CQL
server error (code 0x0000, "NoHostAvailable"-looking), which is both
ugly and incompatible with Cassandra - Cassandra returns a clean
InvalidRequest with the message "Key length of N is longer than
maximum of M".

Fix this at the single chokepoint: compound_type::serialize_value in
keys/compound.hh. The serializer is on every path that materializes a
key - INSERT/UPDATE/DELETE/BATCH build mutations through it, and
SELECT builds partition and clustering ranges through it - so a single
throw replacement produces a clean InvalidRequest consistently across
all paths and all key shapes (single, compound PK, composite CK).

The previous approach on this PR branch patched three call sites in
cql3/restrictions/statement_restrictions.cc, which only covered
SELECT, duplicated the check, and placed it mid-restrictions code
(flagged in review). Dropping those changes in favour of the
root-cause fix here.

Un-xfail the tests this fixes:
- test/cqlpy/test_key_length.py: test_insert_65k_pk, test_insert_65k_ck,
  test_where_65k_pk, test_where_65k_ck, test_insert_65k_ck_composite,
  test_insert_total_compound_pk_err, test_insert_total_composite_ck_err.
- test/cqlpy/cassandra_tests/.../insert_test.py: testPKInsertWithValueOver64K,
  testCKInsertWithValueOver64K.
- test/cqlpy/cassandra_tests/.../select_test.py: testPKQueryWithValueOver64K.

test_insert_65k_pk_compound stays xfail: its oversized value gets
rejected by the Python driver's CQL wire-protocol encoder (see
CASSANDRA-19270) before reaching the server, so the fix can't apply.
Updated its reason. testCKQueryWithValueOver64K stays xfail with an
updated reason: Cassandra silently returns empty for an oversized
clustering key in WHERE, while Scylla now throws InvalidRequest - a
deliberate choice mirroring the partition-key case, documented in
the discussion on #10366.

Add three tight-boundary tests (addressing review feedback on the
previous revision) that pin MAX+1 behaviour for SELECT and INSERT of
both partition and clustering keys.

Update test/cluster/dtest/limits_test.py to match the new message
("Key length of \\d+ is longer than maximum of 65535").

fixes #10366
fixes #12247

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#23433
2026-05-11 16:56:35 +03:00
Piotr Smaron
959f67b345 cql: verify tuples length in multi-column IN restriction
When a multi-column IN restriction contains tuples with a different
number of elements than the number of restricted columns (e.g.
`(b, c, d) IN ((1, 2), (2, 1, 4))`), Scylla would either produce an
inconsistent error message or, for over-sized tuples, an internal
type-mismatch error referencing the list literal representation.

Validate each tuple's arity against the number of restricted columns
while building the IN restriction and raise a clear
"Expected N elements in value tuple, but got M" error in both the
under- and over-sized cases.

Fixes #13241

Co-authored-by: Alexander Turetskiy <someone.tur@gmail.com>

Closes scylladb/scylladb#18407
2026-05-11 16:55:09 +03:00
Anna Stuchlik
a7b7019f90 doc: update the node size limit
This commit increases the node size limit from 256 to 4096 CPUs
based on be1f566488

Fixes SCYLLADB-1676

Closes scylladb/scylladb#29602
2026-05-11 16:38:53 +03:00
Nadav Har'El
f1b2b9bd52 Merge 'Register fulltext_index custom index type' from Dawid Pawlik
This PR adds the `fulltext_index` custom index class, laying the groundwork for full-text search in ScyllaDB. It focuses on the CQL-facing layer - schema validation, option parsing, and metadata - without implementing the search backend itself.

Users can now write:

```cql
CREATE CUSTOM INDEX ON t(content) USING 'fulltext_index'
WITH OPTIONS = {'analyzer': 'english', 'positions': 'false'};
```

The implementation follows the same custom index pattern established by vector search: a `custom_index` subclass registered in the factory map, with no backing materialized view. This keeps the door open for a CDC-based indexing pipeline similar to the one vector search uses.

As part of this work, the option validation helpers (`validate_enumerated_option`, `validate_positive_option`, `validate_factor_option`) were extracted from `vector_index.cc` into a shared header so both index types can reuse them. The `custom_index` base class also gained a virtual `index_type_name()` method, giving each subclass a self-describing name for error messages without hardcoding strings in shared code.

The PR is split into three commits:

1. Extract shared validation utilities and add `index_type_name()` to `custom_index`
2. Implement `fulltext_index` with column type and option validation
3. Integration tests covering creation, validation, describe, and metadata

Fixes: SCYLLADB-1517
Fixes: SCYLLADB-1510
References: SCYLLADB-1516

Closes scylladb/scylladb#29658

* github.com:scylladb/scylladb:
  test/cqlpy: add integration tests for `fulltext_index`
  index: unify custom index description
  index: add `fulltext_index` custom index implementation
  index: extract option validation helpers
2026-05-11 16:16:58 +03:00
Nadav Har'El
fcfad51284 Merge 'cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time' from Marcin Maliszkiewicz
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1756
Backport: no, it's a minor fix and UDFs are experimental feature in Scylla

Closes scylladb/scylladb#29717

* github.com:scylladb/scylladb:
  test/cqlpy: add test for EXECUTE permission on UDA sub-functions
  cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
2026-05-11 16:14:38 +03:00
Botond Dénes
cf37f541a0 Merge ' sstables_loader: ensure upload directory is empty when load_and_stream returns' from Taras Veretilnyk
After `load_and_stream` (e.g. via `nodetool refresh --load-and-stream`)
returns success, source sstable files in the `upload/` directory may
still be on disk. `mark_for_deletion()` only sets an in-memory flag; the
actual file deletion runs lazily when the last `shared_sstable`
reference drops.

This leaves a window between API success and physical deletion where a
follow-up scan of the upload directory can detected sstables that will be deleted soon.
This might cause failure because SSTable will be already wiped during processing.

For fix:
Force unlink to complete before `stream()` returns, so the upload
directory is in a consistent state by the time the API reports success.
For tablet streaming, partially-contained sstables participate in
multiple per-tablet batches; eagerly unlinking after each batch would
break the next batch that still needs to read the file. A
`defer_unlinking` flag on the streamer postpones the explicit unlink
until after all batches complete (called once at the end of
`tablet_sstable_streamer::stream()`). Vnode streaming unlink eagerly at the end of
`stream_sstable_mutations`.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1647

Backport is required, as it is a bug fix that was introduced in 517a4dc4df.

Closes scylladb/scylladb#29599

* github.com:scylladb/scylladb:
  sstables_loader: synchronously unlink streamed sstables before returning
  sstables: make sstable::unlink() idempotent
2026-05-11 14:43:46 +03:00
Asias He
0204372156 repair: Reject repair requests where start and end tokens are equal
When a user calls the repair API with identical startToken and endToken
values, the code creates a wrapping interval (T, T]. This causes
unwrap() to split it into (-inf, T] and (T, +inf), covering the entire
token ring and triggering a full repair.

Reject such requests early with an error message matching
Cassandra's behavior: "Start and end tokens must be different."

Fixes: https://scylladb.atlassian.net/browse/CUSTOMER-358

Closes scylladb/scylladb#29821
2026-05-11 14:08:20 +03:00
Botond Dénes
ad7ac62835 Merge ' Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key' from Dimitrios Symonidis
Add a node_owner column (locator::host_id) to system.sstables and make it part of the partition key, so the primary key becomesv PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into system_distributed: once distributed, each node's startup scan  must read only the rows it owns, which requires the owning node to be part of the partition key. Partitioning by (table_id, node_owner) turns that scan into a single-partition read of exactly the local node's rows.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1562
No need to backport this, keyspace over object storage is experimental feature

Closes scylladb/scylladb#29659

* github.com:scylladb/scylladb:
  db, sstables: add node_owner to sstables registry primary key
  db, sstables: rename sstables registry column owner to table_id
2026-05-11 14:08:19 +03:00
Botond Dénes
2edfb91070 sstables: migrate all bufsize_mismatch_exception throw sites to throw_bufsize_mismatch_exception()
Replace the two remaining direct 'throw bufsize_mismatch_exception(...)'
call sites with the new throw_bufsize_mismatch_exception() helper, which
routes through throw_malformed_sstable_exception() and thus also respects
the --abort-on-malformed-sstable-error flag.

Affected files:
- sstables/sstables.cc (1 site, in check_buf_size())
- sstables/m_format_read_helpers.cc (1 site, in check_buf_size())
2026-05-11 11:58:14 +03:00
Botond Dénes
d65c1523c2 sstables: migrate all malformed_sstable_exception throw sites to throw_malformed_sstable_exception()
Replace all direct 'throw malformed_sstable_exception(...)' call sites
with the new throw_malformed_sstable_exception() helper, which respects
the --abort-on-malformed-sstable-error flag.
2026-05-11 11:58:14 +03:00
Botond Dénes
84c27658d9 sstables: make on_parse_error() and on_bti_parse_error() respect --abort-on-malformed-sstable-error
Both functions now check abort_on_malformed_sstable_error() first. If
set, they log the error and call std::abort() directly, generating a
coredump. Otherwise they fall through to the existing on_internal_error()
path, which is in turn controlled by --abort-on-internal-error.
2026-05-11 11:58:14 +03:00
Botond Dénes
4ebcc002d6 sstables: disable abort-on-malformed-sstable-error in tests that corrupt sstables on purpose
Add scoped_no_abort_on_malformed_sstable_error RAII guard (modeled after
seastar::testing::scoped_no_abort_on_internal_error) and use it in all
tests that intentionally corrupt sstables and expect
malformed_sstable_exception to be thrown rather than the process aborting.
2026-05-11 11:58:14 +03:00
Botond Dénes
f6dc2cb5f8 sstables: introduce --abort-on-malformed-sstable-error infrastructure
Add the --abort-on-malformed-sstable-error command-line option and the
supporting infrastructure. When set, any malformed sstable error will
abort the process and generate a coredump instead of throwing an
exception. This is useful for debugging memory corruption that may
manifest as apparent sstable corruption.

The implementation introduces:
- throw_malformed_sstable_exception() and throw_bufsize_mismatch_exception()
  helper functions in sstables/sstables.cc, which check the new flag and
  either abort (with logging) or throw the appropriate exception.
- set_abort_on_malformed_sstable_error() / abort_on_malformed_sstable_error()
  to control the per-process atomic flag.
- abort_on_malformed_sstable_error config option (LiveUpdate, default false)
  wired up in main.cc alongside abort_on_internal_error.

Call-site migration will follow in subsequent commits.
2026-05-11 11:58:14 +03:00
Botond Dénes
c3daa6379c sstables: refactor parse_path() to return std::expected<> instead of throwing
make_entry_descriptor() and the two overloads of parse_path() used to signal
parse failures by throwing malformed_sstable_exception, which made parse_path()
expensive to use as a probe (e.g. to classify directory entries).

Change make_entry_descriptor() and both parse_path() overloads to return
std::expected<T, sstring>, where the sstring carries the error message on
failure, eliminating the exception overhead at probe call sites.

Call sites that previously caught malformed_sstable_exception to treat the
path as a non-SSTable file (utils/directories.cc, db/snapshot/backup_task.cc,
tools/scylla-sstable.cc) now check the expected result directly.

Call sites where a parse failure is a genuine error (sstable_directory.cc,
sstables.cc, tools/schema_loader.cc, tools/scylla-sstable.cc) re-throw
explicitly as malformed_sstable_exception using the error string, preserving
the existing error propagation behaviour.
2026-05-11 11:58:14 +03:00
Marcin Maliszkiewicz
fa9d15d31a test/cqlpy: add test for EXECUTE permission on UDA sub-functions
Verify that SELECT of a UDA requires EXECUTE on its SFUNC, FINALFUNC,
and REDUCEFUNC individually.  If any one permission is missing, the
query must be rejected at planning time (even on an empty table).

The test is parameterized over the three sub-functions and uses
Lua on Scylla or Java on Cassandra, so it runs on both backends.
The REDUCEFUNC case is skipped on Cassandra since REDUCEFUNC is a
Scylla extension.

Refs SCYLLADB-1756
2026-05-11 10:23:39 +02:00
copilot-swe-agent[bot]
9e7d67612c docs: fix typo in materialized views docs - "columns are" instead of "is"
The MV Select Statement description was missing the word "columns" and
used incorrect verb agreement, making the sentence grammatically broken
and ambiguous.

docs/cql/mv.rst: "which of the base table is included" →
"which of the base table columns are included"

Fixes #29662
Closes #29663

Co-authored-by: annastuchlik <37244380+annastuchlik@users.noreply.github.com>
2026-05-11 11:15:25 +03:00
Botond Dénes
eae15f4fdd Merge 'Share timeout_config between services' from Pavel Emelyanov
The timeout_config (more exactly -- updatable_timeout_config) is used by alternator/controller and transport/controller.  Both create a local copy of that opbject by constructing one out of db::config. Also some options from this config are needed by storage_proxy, but since it doesn't have access to any timeout_config-s, it just uses db::config by getting it from the database.

This PR introduces top-level sharded<updateable_timeout_config>, initializes it from db::config values and makes existing users plus storage_proxy us it where required. Motivation -- remove more replica::database::get_config() users. A side effect -- timeout_config is not duplicated by transport and alternator controllers.

Components' dependencies cleanup, not backporting.

Closes scylladb/scylladb#29636

* github.com:scylladb/scylladb:
  storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
  alternator: Use shared updateable_timeout_config by reference
  cql_transport: Use shared updateable_timeout_config by reference
  storage_proxy: Use shared updateable_timeout_config by reference
  main: Introduce sharded<updateable_timeout_config>
  storage_proxy: Keep own updateable_timeout_config
2026-05-11 11:12:01 +03:00
Nadav Har'El
2501a22b10 alternator: remove unneeded call to format()
Removed a silly call to format() on a constant string without parameters.
2026-05-10 20:34:36 +03:00
Nadav Har'El
b3a62dc9d2 alternator: improve CONTAINS operator's validity checking
Copilot who review the implementation of the CONTAINS operator
complained that in some places we assume without checking that the
user-providing parameter to CONTAINS has the expected structure.

Not doing all the checks explicitly is actually not terrible in
RapidJSON, because its methods like BeginMembers() always validate the
type before trying to follow a pointer, throwing an exception if it
the JSON value doesn't have the right type. But it's still cleaner
to do these checks explicitly, and throw a clean SerializationError
instead of some internal server error. So this is what this patch does.

If the malformed object doesn't come from the query but rather comes
from the data, we just silently return false. This is our usual
convention - we don't expect malformed data in our database, but if
we do have some (see issue #8070) we shouldn't tell the user that
there was an error in his completely valid query.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-10 20:34:36 +03:00
Marcin Maliszkiewicz
fb55bef0ac cql3/selection: require EXECUTE on UDA REDUCEFUNC at SELECT time
selection::used_functions() pushed the UDA, its SFUNC and its FINALFUNC,
but never the REDUCEFUNC. The reducefunc is invoked by the distributed
aggregation path in service::mapreduce_service, so a user could cause it
to run server-side without holding EXECUTE on it as long as the query
took the mapreduce path.

Also push agg.state_reduction_function so select_statement::check_access
requires EXECUTE on it too.

Fixes SCYLLADB-1756
2026-05-08 16:37:52 +02:00
Dawid Pawlik
b6d5ff344b test/cqlpy: add integration tests for fulltext_index
Add `test_fulltext_index.py` covering the `fulltext_index` custom index:
- Creation on text, varchar, and ascii columns
- Rejection of non-text types (int, blob, vector)
- Validation of analyzer and positions options
- Rejection of unsupported option keys
- Case-insensitive class name lookup
- DESCRIBE INDEX output with and without options
- No backing materialized view in `system_schema.views`
- IF NOT EXISTS idempotent behavior
- Metadata correctness in `system_schema.indexes`
2026-05-08 11:30:08 +02:00
Dawid Pawlik
2076164af9 index: unify custom index description
Move common description logic into a protected helper
`describe_with_target` on `custom_index`, so subclasses can delegate
to it when implementing the `describe()` virtual method.
2026-05-08 11:30:08 +02:00
Dawid Pawlik
fcd15b5cd4 index: add fulltext_index custom index implementation
Introduce `fulltext_index`, a new `custom_index` subclass
for full-text search (FTS).

The index validates that the target column is a text type
(text, varchar, or ascii) and supports two WITH OPTIONS keys:
- 'analyzer': one of standard, english, german, french, spanish,
  italian, portuguese, russian, chinese, japanese, korean, simple,
  whitespace
- 'positions': boolean controlling whether term positions are stored

`view_should_exist()` returns false — no backing materialized view is
created, matching the CDC-backed pattern used by `vector_index`.

Fixes: SCYLLADB-1517
2026-05-08 11:30:08 +02:00
Dawid Pawlik
a396129e5c index: extract option validation helpers
Move `validate_enumerated_option`, `validate_positive_option`,
and `validate_factor_option` into shared index option utilities
under the `secondary_index::util` namespace. These functions were
previously defined as file-local statics in `vector_index.cc` with
hardcoded index names in error messages.

The shared versions take `index_type_name` as a parameter, allowing
each `custom_index` subclass to pass its own name via the virtual
`index_type_name()` method at the call site. The options maps use
`std::bind_front` to bind config params (supported values, limits),
leaving `index_name` as the first unbound argument passed by
`check_index_options()`.

Add `index_type_name()` as a pure virtual method on `custom_index`.
Move the shared utility implementations into `index_option_utils.cc`
and update `vector_index.cc` to use them.
2026-05-08 11:28:39 +02:00
Ferenc Szili
f7bc8f5fa7 test: boost: add drain test for forced capacity-based balancing
Add a Boost unit test that forces capacity-based balancing through
configuration and verifies that a drained and excluded node will be
drained of its tablets when tablet size stats are missing.

The test covers the regression where the allocator rejected the plan due
to incomplete tablet stats, even though forced capacity-based balancing
does not depend on tablet sizes.
2026-05-07 13:56:36 +02:00
Ferenc Szili
906d2b817e service: allow draining with forced capacity-based balancing
When force_capacity_based_balancing is enabled, the tablet allocator
balances by node and shard capacity rather than by tablet sizes.

When the data needed for load balancing is incomplete, the balancer
fails and waits until load_stats is available and correct for all the
nodes. An exception to this is when a node is being drained and
excluded: it is unreachable, and will not return. In this case
the balancer has to do its best and ignore the missing data.

This patch fixes a bug where forcing capacity based balancing made the
balancer not ignore missing data in these cases, and instead abort the
balancing.
2026-05-07 13:44:53 +02:00
Taras Veretilnyk
784127c40b sstables_loader: synchronously unlink streamed sstables before returning
mark_for_deletion() only set an in-memory flag; the actual file
deletion ran lazily when the last shared_sstable reference dropped,
leaving a window in which a follow-up scan of the upload directory
(e.g. a second 'nodetool refresh --load-and-stream') could observe a
partially-deleted sstable and fail with malformed_sstable_exception.

Force the unlink to complete before stream() returns. For tablet
streaming, partially-contained sstables span multiple per-tablet
batches, so a defer_unlinking flag postpones the unlink until after
all sstables are streamed; for vnodes and fully-contained sstables are streamed
only once and could be removed just after being streamed.

Added a FIXME on object_storage_base::wipe and strengthened the doc on storage::wipe to
make the never-fails contract explicit
2026-04-28 14:52:28 +02:00
Dimitrios Symonidis
c40842f60a db, sstables: add node_owner to sstables registry primary key
Add a node_owner column (locator::host_id) to system.sstables and
make it part of the partition key, so the primary key becomes
  PRIMARY KEY ((table_id, node_owner), generation).

This is the first step toward moving the sstables registry into
system_distributed: once distributed, each node's startup scan
must read only the rows it owns, which requires the owning node
to be part of the partition key. Partitioning by (table_id,
node_owner) turns that scan into a single-partition read of
exactly the local node's rows.

The new column is populated via sstables_manager::get_local_host_id().
No backward compatibility is preserved; the feature is experimental
and gated by keyspace-storage-options.
2026-04-24 16:41:09 +02:00
Dimitrios Symonidis
ce78c5113e db, sstables: rename sstables registry column owner to table_id
The partition-key column in system.sstables named 'owner' actually
holds a table_id. Rename the CQL column and the matching C++
parameter and member names so the identifier describes what it
stores. No behavior change.

This prepares the schema for an upcoming node_owner partition-key
column (the local host id), which needs a free name.
2026-04-24 16:24:07 +02:00
Pavel Emelyanov
71b9704464 storage_proxy: Use shared updateable_timeout_config for CAS contention timeout
The cas_contention_timeout_in_ms option is already exposed via the
shared updateable_timeout_config as cas_timeout_in_ms. Read it from
there instead of going through db::config, dropping another use of
database as a db::config proxy.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 16:24:32 +03:00
Pavel Emelyanov
33cd3b5d68 alternator: Use shared updateable_timeout_config by reference
Pass sharded<updateable_timeout_config>& into alternator::controller
and through to alternator::server, which now stores a reference
instead of constructing its own updateable_timeout_config from
proxy.data_dictionary().get_config(). This removes the last
creator of a per-owner updateable_timeout_config copy and completes
the consolidation onto the single sharded<updateable_timeout_config>
instance built in main.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:29:39 +03:00
Pavel Emelyanov
1a045d0cdd cql_transport: Use shared updateable_timeout_config by reference
Pass sharded<updateable_timeout_config>& into cql_transport::controller,
which feeds the shard-local instance as a reference into
cql_server_config::timeout_config. This drops the per-shard local
updateable_timeout_config constructed from db::config inside the
controller's sharded_parameter lambda, replacing it with a reference
into the shared sharded instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:21:31 +03:00
Pavel Emelyanov
aa99c1fd6e storage_proxy: Use shared updateable_timeout_config by reference
Drop storage_proxy's own updateable_timeout_config member built from
db::config and take a reference to the shared sharded instance
introduced by the previous patch. Both main and cql_test_env pass
std::ref(timeout_cfg) into storage_proxy::start so each shard's
storage_proxy references its shard-local updateable_timeout_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:07:21 +03:00
Pavel Emelyanov
7b7295fde0 main: Introduce sharded<updateable_timeout_config>
Build a single sharded updateable_timeout_config from db::config in
both main and cql_test_env, sitting next to sharded<cql_config>.
Subsequent patches migrate storage_proxy, the CQL transport controller
and alternator server from their per-owner updateable_timeout_config
copies to references into this shared instance.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 15:03:35 +03:00
Pavel Emelyanov
7ca8a863d9 storage_proxy: Keep own updateable_timeout_config
Storage_proxy was reading read_request_timeout_in_ms and
write_request_timeout_in_ms directly from db::config via
database::get_config() at four call sites. Give storage_proxy its own
updateable_timeout_config member (built from db::config the same way
cql transport controller and alternator server do) and use its
read_timeout_in_ms / write_timeout_in_ms observers instead.

Storage_proxy no longer needs database::get_config() for coordinator
timeout values. A later refactor may turn these per-owner copies into
references to a single shared updateable_timeout_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 14:27:09 +03:00
Michał Jadwiszczak
2b29962583 test/strong_consistency: verify metrics
This patch adds simple asserts to an existing `test_basic_write_read`
to verify that strong consistency metrics are correctly collected.
2026-04-22 10:06:49 +02:00
Michał Jadwiszczak
7352b37048 test/cluster/test_view_building_coordinator: add reproducer for tombstone threshold warning 2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
396d4b17a0 docs: document tombstone avoidance in view_building_tasks 2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
1162fd315e view_building: add task_uuid_generator to view_building_task_mutation_builder
Following previous commit, use the generator in view building task mutation builder.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
b64f2d2e90 view_building: introduce task_uuid_generator
With the new `min_alive_uuid` saved in the group0 table,
we need to make sure that all new tasks are created with time uuid
greater than the value saved in `min_alive_uuid`.

This patch introduces the `task_uuid_generator` which ensures that
when we are generating multiple tasks in one group0 command, each task
will have an unique time uuid and each time uuid will be greater than
`min_alive_uuid`.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
e5a6ed72b9 view_building: store min_alive_uuid in view building state
Because now we're limiting the range we're reading from view building
tasks table, we need to make sure that new tasks are created with larger
uuid then the `min_alive_uuid`.

In order to do it, we need to be able to see current `min_alive_uuid`
while creating new tasks.
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8d0943ce35 view_building: set min_task_id when GC-ing finished tasks
When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, write min_task_id
alongside the range tombstone in the same Raft batch. min_task_id is set
to min_alive_uuid so subsequent get_view_building_tasks() scans start
exactly at the first alive row, skipping all tombstoned rows.

When all tasks are deleted, min_task_id is set to a freshly generated UUID
to ensure future tasks (which will have larger timeuuids) are not skipped.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
b689de0414 view_building: add min_task_id support to view_building_task_mutation_builder
Add set_min_task_id(id) which writes the min_task_id static cell to the main
"view_building" partition. The static cell is written as part of the same
mutation as the range tombstone, keeping everything in one Raft batch.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8670111cd4 view_building: add min_task_id static column and bounded scan to system_keyspace
Add a min_task_id timeuuid static column to system.view_building_tasks.

When VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, get_view_building_tasks()
reads min_task_id first using a static-only partition slice (empty _row_ranges +
always_return_static_content). This makes the SSTable reader stop immediately
after the static row before processing any clustering tombstones, so the read
never triggers tombstone_warn_threshold warnings.

min_task_id is then used as AND id >= ? lower bound for the main task scan,
skipping all tombstoned rows below the boundary.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:14 +02:00
Michał Jadwiszczak
8f741b462b view_building: use range tombstone when GC-ing finished tasks
Instead of issuing one row tombstone per finished task, collect all tasks
to delete, find the smallest timeuuid among alive tasks (min_alive_uuid),
then emit a single range tombstone [before_all, min_alive_uuid) covering
all tasks below that boundary. Tasks above the boundary (rare: finished
task interleaved with alive tasks) still get individual row tombstones.

When no alive tasks remain, del_all_tasks() covers the entire partition
with a single range tombstone.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
91697d597c view_building: add range tombstone support to view_building_task_mutation_builder
Add del_tasks_before(id) which emits a range tombstone [before_all, id)
and del_all_tasks() which covers the entire clustering range. These will
be used by the coordinator to delete finished tasks in bulk instead of
issuing one row tombstone per task.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:13 +02:00
Michał Jadwiszczak
e0942bb45a view_building: introduce VIEW_BUILDING_TASKS_MIN_TASK_ID cluster feature
This feature will be used to gate the use of min_task_id static column
in system.view_building_tasks, which will be added in a subsequent commit.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-22 09:10:12 +02:00
Michał Jadwiszczak
f77c258c8e strong_consistency: wire up metrics to operations
Track write and read latency using latency_counter in
coordinator::mutate() and coordinator::query().

Count commit_status_unknown errors in coordinator::mutate().

Count node and shard bounces in redirect_statement(), passing the
coordinator's stats from both modification_statement and
select_statement.
2026-04-22 08:59:59 +02:00
Michał Jadwiszczak
55293c34f8 strong_consistency: add stats struct and metrics registration
Introduce per-shard metrics infrastructure for strong consistency
operations under the "strong_consistency_coordinator" metrics category.

The stats struct contains latency histograms/summaries for reads and
writes (using timed_rate_moving_average_summary_and_histogram, same as
storage_proxy uses for eventual consistency), and uint64_t counters for
write_status_unknown, node bounces, and shard bounces.

Metrics are registered in the coordinator constructor but are not yet
wired to actual operations — all counters remain at zero.
2026-04-22 08:58:38 +02:00
Taras Veretilnyk
7cdf215999 sstables: make sstable::unlink() idempotent
Avoid duplicate work when unlink() is called more than once on the
same sstable. This happens when a caller invokes unlink() explicitly
on an sstable that is also marked for deletion: the destructor's
close_files() path would otherwise call unlink() again, re-firing
_on_delete, double-counting _stats.on_delete() and double-invoking
_manager.on_unlink().
2026-04-21 22:41:02 +02:00
Ernest Zaslavsky
9faaf1f09c test: extract object storage helpers to test/pylib/object_storage.py
Move S3/GCS server classes (S3Server, MinioWrapper, GSFront, GSServer),
factory functions (create_s3_server, create_gs_server), CQL helpers
(format_tuples, keyspace_options), bucket naming (_make_bucket_name),
and the s3_server fixture from test/cluster/object_store/conftest.py
into a shared module at test/pylib/object_storage.py.
The conftest.py is now a thin wrapper that re-exports symbols and
defines only the fixtures specific to the object_store suite
(object_storage, s3_storage).  All external importers are updated.
Old class names (S3_Server, GSServer) are kept as aliases for
backward compatibility.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e9724f52a9 test: add per-test bucket isolation to object_store fixtures
Create a unique S3/GCS bucket for each test function using the pytest
test name (from request.node.name), sanitized into a valid bucket name.
This ensures tests do not share state through a common bucket and makes
bucket names meaningful for debugging (e.g. test-basic-s3-a1b2c3d4).
Each fixture now calls create_test_bucket() on setup and
destroy_test_bucket() on teardown.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
8e02e99c36 s3: add client::make overload with custom retry strategy
Add a client::make overload that accepts a custom retry strategy,
allowing callers to override the default exponential backoff.
Use this in s3_test.cc with a test_retry_strategy that sleeps only
1ms between retries instead of exponential backoff, significantly
reducing test runtime for tests that encounter transient errors
during bucket creation/deletion.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
e175088db5 test: add s3_test_fixture and migrate tests to per-bucket isolation
Add s3_test_fixture, an RAII class that creates a unique S3 bucket
on construction and tears down everything (delete all objects, delete
bucket, close client) on destruction. Bucket names are derived from
the Boost test name, pid, and a counter to guarantee uniqueness
across concurrent test processes. Names are sanitized to comply with
S3 bucket naming rules (lowercase, hyphens, 3-63 chars).
Migrate all S3 tests that create objects to use the fixture, removing
manual bucket name construction, deferred_delete_object cleanup, and
per-test deferred_close calls. The fixture owns the client lifecycle.
Tests with special semaphore requirements (broken semaphore for
fallback test, small semaphore for abort test, 1MiB for memory
test) create the fixture with a separate normal-sized semaphore and
use their own constrained client for the test operation.
The upload_file tests are converted from SEASTAR_TEST_CASE
(coroutine) to SEASTAR_THREAD_TEST_CASE since the fixture requires
thread context for .get() calls.
Broaden the minio policy to allow the test user to create and delete
arbitrary buckets (s3:CreateBucket, s3:DeleteBucket, s3:ListAllMyBuckets
on arn:aws:s3:::*), and operate on objects in any bucket.
2026-04-21 19:08:57 +03:00
Ernest Zaslavsky
cc0b9791c7 s3: add create_bucket and delete_bucket to client
Add create_bucket (PUT /<bucket>) and delete_bucket (DELETE /<bucket>)
methods to s3::client, following the same make_request pattern used by
existing object operations.
These will be used by the test infrastructure to create per-test
isolated buckets.
2026-04-21 19:08:57 +03:00
179 changed files with 5050 additions and 1377 deletions

View File

@@ -4,6 +4,8 @@ on:
milestone:
types: [created, closed]
permissions: {}
jobs:
sync-milestone-to-jira:
uses: scylladb/github-automation/.github/workflows/main_sync_milestone_to_jira_release.yml@main

View File

@@ -299,6 +299,7 @@ target_sources(scylla-main
serializer.cc
service/direct_failure_detector/failure_detector.cc
sstables_loader.cc
sstables_loader_helpers.cc
table_helper.cc
tasks/task_handler.cc
tasks/task_manager.cc

View File

@@ -247,6 +247,18 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2, bool v1_from
if (!v1) {
return false;
}
if (!v1->IsObject() || v1->MemberCount() != 1) {
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator encountered malformed AttributeValue");
}
return false;
}
if (!v2.IsObject() || v2.MemberCount() != 1) {
if (v2_from_query) {
throw api_error::serialization("CONTAINS operator encountered malformed AttributeValue");
}
return false;
}
const auto& kv1 = *v1->MemberBegin();
const auto& kv2 = *v2.MemberBegin();
if (kv1.name == "S" && kv2.name == "S") {
@@ -265,9 +277,17 @@ bool check_CONTAINS(const rjson::value* v1, const rjson::value& v2, bool v1_from
}
}
} else if (kv1.name == "L") {
if (!kv1.value.IsArray()) {
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator received a malformed list");
}
return false;
}
for (auto i = kv1.value.Begin(); i != kv1.value.End(); ++i) {
if (!i->IsObject() || i->MemberCount() != 1) {
clogger.error("check_CONTAINS received a list whose element is malformed");
if (v1_from_query) {
throw api_error::serialization("CONTAINS operator received a list whose element is malformed");
}
return false;
}
const auto& el = *i->MemberBegin();

View File

@@ -38,6 +38,7 @@ controller::controller(
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
sharded<updateable_timeout_config>& timeout_config,
const db::config& config,
seastar::scheduling_group sg)
: protocol_server(sg)
@@ -52,6 +53,7 @@ controller::controller(
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _vsc(vsc)
, _timeout_config(timeout_config)
, _config(config)
{
}
@@ -99,7 +101,7 @@ future<> controller::start_server() {
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks), std::ref(_sys_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), std::ref(_vsc), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller), std::ref(_timeout_config)).get();
// Note: from this point on, if start_server() throws for any reason,
// it must first call stop_server() to stop the executor and server
// services we just started - or Scylla will cause an assertion

View File

@@ -48,6 +48,8 @@ namespace vector_search {
class vector_store_client;
}
class updateable_timeout_config;
namespace alternator {
// This is the official DynamoDB API version.
@@ -72,6 +74,7 @@ class controller : public protocol_server {
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
sharded<vector_search::vector_store_client>& _vsc;
sharded<updateable_timeout_config>& _timeout_config;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -92,6 +95,7 @@ public:
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
sharded<updateable_timeout_config>& timeout_config,
const db::config& config,
seastar::scheduling_group sg);

View File

@@ -485,7 +485,7 @@ std::optional<bytes> unwrap_bytes(const rjson::value& value, bool from_query) {
return rjson::base64_decode(value);
} catch (...) {
if (from_query) {
throw api_error::serialization(format("Invalid base64 data"));
throw api_error::serialization("Invalid base64 data");
}
return std::nullopt;
}

View File

@@ -835,7 +835,7 @@ void server::set_routes(routes& r) {
//FIXME: A way to immediately invalidate the cache should be considered,
// e.g. when the system table which stores the keys is changed.
// For now, this propagation may take up to 1 minute.
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller)
server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& auth_service, qos::service_level_controller& sl_controller, updateable_timeout_config& timeout_config)
: _http_server("http-alternator")
, _https_server("https-alternator")
, _executor(exec)
@@ -847,7 +847,7 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _max_users_query_size_in_trace_output(1024)
, _enabled_servers{}
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _timeout_config(timeout_config)
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);

View File

@@ -16,6 +16,7 @@
#include <seastar/net/tls.hh>
#include <optional>
#include "alternator/auth.hh"
#include "timeout_config.hh"
#include "service/qos/service_level_controller.hh"
#include "utils/small_vector.hh"
#include "utils/updateable_value.hh"
@@ -53,8 +54,8 @@ class server : public peering_sharded_service<server> {
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
// though it isn't really relevant for Alternator which defines its own
// timeouts separately. We can create this object only once.
updateable_timeout_config _timeout_config;
// timeouts separately.
updateable_timeout_config& _timeout_config;
client_options_cache_type _connection_options_keys_and_values;
alternator_callbacks_map _callbacks;
@@ -98,7 +99,7 @@ class server : public peering_sharded_service<server> {
utils::scoped_item_list<ongoing_request> _ongoing_requests;
public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller, updateable_timeout_config& timeout_config);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port,
std::optional<uint16_t> port_proxy_protocol, std::optional<uint16_t> https_port_proxy_protocol,

View File

@@ -974,6 +974,54 @@
}
]
},
{
"path":"/storage_service/tablets/restore",
"operations":[
{
"method":"POST",
"summary":"Starts copying SSTables from a designated bucket in object storage to a specified keyspace",
"type":"string",
"nickname":"tablet_aware_restore",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"Name of a keyspace to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"table",
"description":"Name of a table to copy SSTables to",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"snapshot",
"description":"Name of the snapshot to restore from",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"backup_location",
"description":"JSON array of backup location objects. Each object must contain: 'datacenter' (string), 'endpoint' (string), 'bucket' (string), and 'manifests' (array of strings). Currently, the array must contain exactly one entry.",
"required":true,
"allowMultiple":false,
"type":"array",
"paramType":"body"
}
]
}
]
},
{
"path":"/storage_service/keyspace_compaction/{keyspace}",
"operations":[

View File

@@ -527,11 +527,56 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
co_return json::json_return_type(fmt::to_string(task_id));
});
ss::tablet_aware_restore.set(r, [&ctx, &sst_loader](std::unique_ptr<http::request> req) -> future<json_return_type> {
std::string keyspace = req->get_query_param("keyspace");
std::string table = req->get_query_param("table");
std::string snapshot = req->get_query_param("snapshot");
rjson::chunked_content content = co_await util::read_entire_stream(*req->content_stream);
rjson::value parsed = rjson::parse(std::move(content));
if (!parsed.IsArray()) {
throw httpd::bad_param_exception("backup locations (in body) must be a JSON array");
}
const auto& locations = parsed.GetArray();
if (locations.Size() != 1) {
throw httpd::bad_param_exception("backup locations array (in body) must contain exactly one entry");
}
const auto& location = locations[0];
if (!location.IsObject()) {
throw httpd::bad_param_exception("backup location (in body) must be a JSON object");
}
auto endpoint = rjson::to_string_view(location["endpoint"]);
auto bucket = rjson::to_string_view(location["bucket"]);
auto dc = rjson::to_string_view(location["datacenter"]);
if (!location.HasMember("manifests") || !location["manifests"].IsArray()) {
throw httpd::bad_param_exception("backup location entry must have 'manifests' array");
}
auto manifests = location["manifests"].GetArray() |
std::views::transform([] (const auto& m) { return sstring(rjson::to_string_view(m)); }) |
std::ranges::to<utils::chunked_vector<sstring>>();
if (manifests.empty()) {
throw httpd::bad_param_exception("backup location 'manifests' array must not be empty");
}
apilog.info("Tablet restore for {}:{} called. Parameters: snapshot={} datacenter={} endpoint={} bucket={} manifests_count={}",
keyspace, table, snapshot, dc, endpoint, bucket, manifests.size());
auto table_id = validate_table(ctx.db.local(), keyspace, table);
auto task_id = co_await sst_loader.local().restore_tablets(table_id, keyspace, table, snapshot, sstring(endpoint), sstring(bucket), std::move(manifests));
co_return json::json_return_type(fmt::to_string(task_id));
});
}
void unset_sstables_loader(http_context& ctx, routes& r) {
ss::load_new_ss_tables.unset(r);
ss::start_restore.unset(r);
ss::tablet_aware_restore.unset(r);
}
void set_view_builder(http_context& ctx, routes& r, sharded<db::view::view_builder>& vb, sharded<gms::gossiper>& g) {

View File

@@ -564,6 +564,7 @@ scylla_tests = set([
'test/boost/crc_test',
'test/boost/dict_trainer_test',
'test/boost/dirty_memory_manager_test',
'test/boost/tablet_aware_restore_test',
'test/boost/double_decker_test',
'test/boost/duration_test',
'test/boost/dynamic_bitset_test',
@@ -1172,6 +1173,8 @@ scylla_core = (['message/messaging_service.cc',
'index/secondary_index_manager.cc',
'index/secondary_index.cc',
'index/vector_index.cc',
'index/fulltext_index.cc',
'index/index_option_utils.cc',
'utils/UUID_gen.cc',
'utils/i_filter.cc',
'utils/bloom_filter.cc',
@@ -1334,6 +1337,7 @@ scylla_core = (['message/messaging_service.cc',
'ent/ldap/ldap_connection.cc',
'reader_concurrency_semaphore.cc',
'sstables_loader.cc',
'sstables_loader_helpers.cc',
'utils/utf8.cc',
'utils/ascii.cc',
'utils/like_matcher.cc',
@@ -1473,6 +1477,7 @@ idls = ['idl/gossip_digest.idl.hh',
'idl/frozen_mutation.idl.hh',
'idl/reconcilable_result.idl.hh',
'idl/streaming.idl.hh',
'idl/sstables_loader.idl.hh',
'idl/paging_state.idl.hh',
'idl/frozen_schema.idl.hh',
'idl/repair.idl.hh',

View File

@@ -13,6 +13,7 @@
#include "cql3/prepare_context.hh"
#include "cql3/expr/expr-utils.hh"
#include "types/list.hh"
#include "types/tuple.hh"
#include <iterator>
#include <ranges>
@@ -116,6 +117,34 @@ void validate_token_relation(const std::vector<const column_definition*> column_
}
}
void validate_tuples_size(const expression& rhs, size_t valid_size) {
auto coll = as_if<collection_constructor>(&rhs);
if (!coll) {
// Pre-prepare, the IN list arrives as a collection_constructor.
// After prepare it would be a constant of list type whose elements
// are serialized; arity validation has already happened earlier in
// that case, so nothing to do here.
return;
}
for (const auto& expr : coll->elements) {
size_t expr_size = 0;
if (auto tuple = as_if<tuple_constructor>(&expr)) {
expr_size = tuple->elements.size();
} else {
auto the_const = as_if<constant>(&expr);
if (the_const && the_const->type->without_reversed().is_tuple()) {
const tuple_type_impl* const_tuple = dynamic_cast<const tuple_type_impl*>(&the_const->type->without_reversed());
expr_size = const_tuple->size();
} else {
continue; // not a tuple; perhaps we need to set expr_size to 1 here when #12554 is fixed
}
}
if (expr_size != valid_size) {
throw exceptions::invalid_request_exception(format("Expected {} elements in value tuple, but got {}: {}", valid_size, expr_size, expr));
}
}
}
void preliminary_binop_vaidation_checks(const binary_operator& binop) {
if (binop.op == oper_t::NEQ) {
throw exceptions::invalid_request_exception(format("Unsupported \"!=\" relation: {:user}", binop));
@@ -142,6 +171,10 @@ void preliminary_binop_vaidation_checks(const binary_operator& binop) {
throw exceptions::invalid_request_exception("LIKE cannot be used for Multi-column relations");
}
if (binop.op == oper_t::IN) {
validate_tuples_size(binop.rhs, lhs_tup->elements.size());
}
if (auto rhs_tup = as_if<tuple_constructor>(&binop.rhs)) {
if (lhs_tup->elements.size() != rhs_tup->elements.size()) {
throw exceptions::invalid_request_exception(

View File

@@ -343,102 +343,102 @@ to_predicates(
auto cdef = col.col;
auto type = &cdef->type->without_reversed();
if (oper.op == oper_t::IS_NOT) {
return to_vector(predicate{
.solve_for = nullptr,
.filter = oper,
.on = on_column{col.col},
.is_not_null_single_column = is_null_constant(oper.rhs),
.op = oper.op,
});
return to_vector(predicate{
.solve_for = nullptr,
.filter = oper,
.on = on_column{col.col},
.is_not_null_single_column = is_null_constant(oper.rhs),
.op = oper.op,
});
}
if (is_compare(oper.op)) {
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, std::move(*val));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return oper.op == oper_t::EQ ? value_set(value_list{*val})
: to_range(oper.op, std::move(*val));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
} else if (oper.op == oper_t::IN) {
auto solve = [oper, type, cdef] (const query_options& options) {
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.is_in = true,
.order = oper.order,
.op = oper.op,
});
auto solve = [oper, type, cdef] (const query_options& options) {
return get_IN_values(oper.rhs, options, type->as_less_comparator(), cdef->name_as_text());
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.is_in = true,
.order = oper.order,
.op = oper.op,
});
} else if (oper.op == oper_t::CONTAINS || oper.op == oper_t::CONTAINS_KEY) {
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return value_set(value_list{*val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) {
managed_bytes_opt val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
return value_set(value_list{*val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = false,
.order = oper.order,
.op = oper.op,
});
}
return cannot_solve_on_column(oper, col.col);
},
[&] (const subscript& s) -> std::vector<predicate> {
const column_value& col = get_subscripted_column(s);
if (oper.op == oper_t::EQ) {
auto solve = [s, oper] (const query_options& options) {
managed_bytes_opt sval = evaluate(s.sub, options).to_managed_bytes_opt();
if (!sval) {
return empty_value_set; // NULL can't be a map key
}
if (oper.op == oper_t::EQ) {
auto solve = [s, oper] (const query_options& options) {
managed_bytes_opt sval = evaluate(s.sub, options).to_managed_bytes_opt();
if (!sval) {
return empty_value_set; // NULL can't be a map key
}
managed_bytes_opt rval = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!rval) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
managed_bytes_opt elements[] = {sval, rval};
managed_bytes val = tuple_type_impl::build_value_fragmented(elements);
return value_set(value_list{val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = true,
.equality = true,
.order = oper.order,
.op = oper.op,
.is_subscript = true,
});
}
return cannot_solve_on_column(oper, col.col);
managed_bytes_opt rval = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!rval) {
return empty_value_set; // All NULL comparisons fail; no column values match.
}
managed_bytes_opt elements[] = {sval, rval};
managed_bytes val = tuple_type_impl::build_value_fragmented(elements);
return value_set(value_list{val});
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_column{col.col},
.is_singleton = true,
.equality = true,
.order = oper.order,
.op = oper.op,
.is_subscript = true,
});
}
return cannot_solve_on_column(oper, col.col);
},
[&] (const tuple_constructor& tuple) -> std::vector<predicate> {
auto columns = tuple.elements
| std::views::transform([] (const expression& e) { return as<column_value>(e).col; })
| std::ranges::to<std::vector>();
| std::views::transform([] (const expression& e) { return as<column_value>(e).col; })
| std::ranges::to<std::vector>();
for (unsigned i = 0; i < columns.size(); ++i) {
if (!columns[i]->is_clustering_key() || columns[i]->position() != i) {
on_internal_error(rlogger, "to_predicates: multi-column relation not on a clustering key prefix");
@@ -481,42 +481,42 @@ to_predicates(
if (!(oper.op == oper_t::EQ || is_slice(oper.op))) {
return cannot_solve(oper);
}
auto solve = [oper] (const query_options& options) -> value_set {
auto val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no token values match.
}
if (oper.op == oper_t::EQ) {
return value_list{*val};
} else if (oper.op == oper_t::GT) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), exclusive));
} else if (oper.op == oper_t::GTE) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), inclusive));
}
static const managed_bytes MININT = managed_bytes(serialized(std::numeric_limits<int64_t>::min())),
MAXINT = managed_bytes(serialized(std::numeric_limits<int64_t>::max()));
// Undocumented feature: when the user types `token(...) < MININT`, we interpret
// that as MAXINT for some reason.
const auto adjusted_val = (*val == MININT) ? MAXINT : *val;
if (oper.op == oper_t::LT) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), exclusive));
} else if (oper.op == oper_t::LTE) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), inclusive));
}
throw std::logic_error(format("get_token_interval unexpected operator {}", oper.op));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_partition_key_token{table_schema_opt},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
auto solve = [oper] (const query_options& options) -> value_set {
auto val = evaluate(oper.rhs, options).to_managed_bytes_opt();
if (!val) {
return empty_value_set; // All NULL comparisons fail; no token values match.
}
if (oper.op == oper_t::EQ) {
return value_list{*val};
} else if (oper.op == oper_t::GT) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), exclusive));
} else if (oper.op == oper_t::GTE) {
return interval<managed_bytes>::make_starting_with(interval_bound(std::move(*val), inclusive));
}
static const managed_bytes MININT = managed_bytes(serialized(std::numeric_limits<int64_t>::min())),
MAXINT = managed_bytes(serialized(std::numeric_limits<int64_t>::max()));
// Undocumented feature: when the user types `token(...) < MININT`, we interpret
// that as MAXINT for some reason.
const auto adjusted_val = (*val == MININT) ? MAXINT : *val;
if (oper.op == oper_t::LT) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), exclusive));
} else if (oper.op == oper_t::LTE) {
return interval<managed_bytes>::make_ending_with(interval_bound(std::move(adjusted_val), inclusive));
}
throw std::logic_error(format("get_token_interval unexpected operator {}", oper.op));
};
return to_vector(predicate{
.solve_for = std::move(solve),
.filter = oper,
.on = on_partition_key_token{table_schema_opt},
.is_singleton = (oper.op == oper_t::EQ),
.equality = (oper.op == oper_t::EQ),
.is_slice = expr::is_slice(oper.op),
.is_upper_bound = (oper.op == oper_t::LT || oper.op == oper_t::LTE),
.is_lower_bound = (oper.op == oper_t::GT || oper.op == oper_t::GTE),
.order = oper.order,
.op = oper.op,
});
},
[&] (const binary_operator&) -> std::vector<predicate> {
return cannot_solve(oper);
@@ -555,7 +555,7 @@ to_predicates(
return cannot_solve(oper);
},
}, oper.lhs);
},
},
[] (const column_value& cv) -> std::vector<predicate> {
return cannot_solve(cv);
},
@@ -806,26 +806,26 @@ bool is_empty_restriction(const expression& e) {
static
std::function<bytes_opt (const query_options&)>
build_value_for_fn(const column_definition& cdef, const expression& e, const schema& s) {
auto ac = to_predicate_on_column(e, &cdef, &s);
return [ac] (const query_options& options) -> bytes_opt {
value_set possible_vals = solve(ac, options);
return std::visit(overloaded_functor {
[&](const value_list& val_list) -> bytes_opt {
if (val_list.empty()) {
return std::nullopt;
}
auto ac = to_predicate_on_column(e, &cdef, &s);
return [ac] (const query_options& options) -> bytes_opt {
value_set possible_vals = solve(ac, options);
return std::visit(overloaded_functor {
[&](const value_list& val_list) -> bytes_opt {
if (val_list.empty()) {
return std::nullopt;
}
if (val_list.size() != 1) {
on_internal_error(expr_logger, format("expr::value_for - multiple possible values for column: {}", ac.filter));
}
if (val_list.size() != 1) {
on_internal_error(expr_logger, format("expr::value_for - multiple possible values for column: {}", ac.filter));
}
return to_bytes(val_list.front());
},
[&](const interval<managed_bytes>&) -> bytes_opt {
on_internal_error(expr_logger, format("expr::value_for - possible values are a range: {}", ac.filter));
}
}, possible_vals);
};
return to_bytes(val_list.front());
},
[&](const interval<managed_bytes>&) -> bytes_opt {
on_internal_error(expr_logger, format("expr::value_for - possible values are a range: {}", ac.filter));
}
}, possible_vals);
};
}
bool contains_multi_column_restriction(const expression& e) {
@@ -1337,11 +1337,11 @@ statement_restrictions::ck_restrictions_need_filtering() const {
}
return has_partition_key_unrestricted_components()
|| clustering_key_restrictions_need_filtering()
// If token restrictions are present in an indexed query, then all other restrictions need to be filtered.
// A single token restriction can have multiple matching partition key values.
// Because of this we can't create a clustering prefix with more than token restriction.
|| (_uses_secondary_indexing && has_token_restrictions());
|| clustering_key_restrictions_need_filtering()
// If token restrictions are present in an indexed query, then all other restrictions need to be filtered.
// A single token restriction can have multiple matching partition key values.
// Because of this we can't create a clustering prefix with more than token restriction.
|| (_uses_secondary_indexing && has_token_restrictions());
}
bool
@@ -1705,28 +1705,28 @@ dht::partition_range_vector statement_restrictions::get_partition_key_ranges(con
get_partition_key_ranges_fn_t
statement_restrictions::build_partition_key_ranges_fn() const {
return std::visit(overloaded_functor{
[&] (const no_partition_range_restrictions&) -> get_partition_key_ranges_fn_t {
return [] (const query_options& options) -> dht::partition_range_vector{
return {dht::partition_range::make_open_ended_both_sides()};
};
},
[&] (const token_range_restrictions& r) -> get_partition_key_ranges_fn_t {
return [&] (const query_options& options) -> dht::partition_range_vector {
return partition_ranges_from_token(r.token_restrictions, options, *_schema);
};
},
[&] (const single_column_partition_range_restrictions& r) -> get_partition_key_ranges_fn_t {
if (_partition_range_is_simple) {
return [&] (const query_options& options) {
// Special case to avoid extra allocations required for a Cartesian product.
return partition_ranges_from_EQs(r.per_column_restrictions, options, *_schema);
[&] (const no_partition_range_restrictions&) -> get_partition_key_ranges_fn_t {
return [] (const query_options& options) -> dht::partition_range_vector{
return {dht::partition_range::make_open_ended_both_sides()};
};
} else {
return [&] (const query_options& options) {
return partition_ranges_from_singles(r.per_column_restrictions, options, *_schema);
},
[&] (const token_range_restrictions& r) -> get_partition_key_ranges_fn_t {
return [&] (const query_options& options) -> dht::partition_range_vector {
return partition_ranges_from_token(r.token_restrictions, options, *_schema);
};
}
}}, _partition_range_restrictions);
},
[&] (const single_column_partition_range_restrictions& r) -> get_partition_key_ranges_fn_t {
if (_partition_range_is_simple) {
return [&] (const query_options& options) {
// Special case to avoid extra allocations required for a Cartesian product.
return partition_ranges_from_EQs(r.per_column_restrictions, options, *_schema);
};
} else {
return [&] (const query_options& options) {
return partition_ranges_from_singles(r.per_column_restrictions, options, *_schema);
};
}
}}, _partition_range_restrictions);
}
namespace {
@@ -1970,28 +1970,28 @@ build_get_multi_column_clustering_bounds_fn(
}
});
}
return [schema, range_builders, all_natural, all_reverse] (const query_options& options) -> std::vector<query::clustering_range> {
multi_column_range_accumulator acc;
for (auto& builder : range_builders) {
builder(acc, options);
}
auto bounds = std::move(acc.ranges);
return [schema, range_builders, all_natural, all_reverse] (const query_options& options) -> std::vector<query::clustering_range> {
multi_column_range_accumulator acc;
for (auto& builder : range_builders) {
builder(acc, options);
}
auto bounds = std::move(acc.ranges);
if (!all_natural && !all_reverse) {
std::vector<query::clustering_range> bounds_in_clustering_order;
for (const auto& b : bounds) {
const auto eqv = get_equivalent_ranges(b, *schema);
bounds_in_clustering_order.insert(bounds_in_clustering_order.end(), eqv.cbegin(), eqv.cend());
if (!all_natural && !all_reverse) {
std::vector<query::clustering_range> bounds_in_clustering_order;
for (const auto& b : bounds) {
const auto eqv = get_equivalent_ranges(b, *schema);
bounds_in_clustering_order.insert(bounds_in_clustering_order.end(), eqv.cbegin(), eqv.cend());
}
return bounds_in_clustering_order;
}
return bounds_in_clustering_order;
}
if (all_reverse) {
for (auto& crange : bounds) {
crange = query::clustering_range(crange.end(), crange.start());
if (all_reverse) {
for (auto& crange : bounds) {
crange = query::clustering_range(crange.end(), crange.start());
}
}
}
return bounds;
};
return bounds;
};
}
/// Reverses the range if the type is reversed. Why don't we have interval::reverse()??
@@ -2288,17 +2288,17 @@ build_range_from_raw_bounds_fn(
std::vector<std::function<query::clustering_range (const query_options&)>> range_builders;
for (const auto& e : exprs | std::views::transform(&predicate::filter)) {
if (auto b = find_clustering_order(e)) {
range_builders.emplace_back([bb = *b, &schema] (const query_options& options) {
auto* b = &bb;
cql3::raw_value tup_val = expr::evaluate(b->rhs, options);
if (tup_val.is_null()) {
on_internal_error(rlogger, format("range_from_raw_bounds: unexpected atom {}", *b));
}
range_builders.emplace_back([bb = *b, &schema] (const query_options& options) {
auto* b = &bb;
cql3::raw_value tup_val = expr::evaluate(b->rhs, options);
if (tup_val.is_null()) {
on_internal_error(rlogger, format("range_from_raw_bounds: unexpected atom {}", *b));
}
const auto r = to_range(
const auto r = to_range(
b->op, clustering_key_prefix::from_optional_exploded(schema, expr::get_tuple_elements(tup_val, *type_of(b->rhs))));
return r;
});
return r;
});
}
}
return [range_builders] (const query_options& options) -> std::vector<query::clustering_range> {
@@ -2322,9 +2322,9 @@ build_range_from_raw_bounds_fn(
get_clustering_bounds_fn_t
statement_restrictions::build_get_clustering_bounds_fn() const {
if (_clustering_prefix_restrictions.empty()) {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return {query::clustering_range::make_open_ended_both_sides()};
};
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return {query::clustering_range::make_open_ended_both_sides()};
};
}
if (_clustering_prefix_restrictions[0].is_multi_column) {
bool all_natural = true, all_reverse = true; ///< Whether column types are reversed or natural.
@@ -2342,14 +2342,14 @@ statement_restrictions::build_get_clustering_bounds_fn() const {
}
}
}
return build_get_multi_column_clustering_bounds_fn(_schema, _clustering_prefix_restrictions,
all_natural, all_reverse);
} else {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return get_single_column_clustering_bounds(options, *_schema, _clustering_prefix_restrictions);
};
return build_get_multi_column_clustering_bounds_fn(_schema, _clustering_prefix_restrictions,
all_natural, all_reverse);
} else {
return [&] (const query_options& options) -> std::vector<query::clustering_range> {
return get_single_column_clustering_bounds(options, *_schema, _clustering_prefix_restrictions);
};
}
}
}
std::vector<query::clustering_range> statement_restrictions::get_clustering_bounds(const query_options& options) const {
return _get_clustering_bounds_fn(options);
@@ -2475,11 +2475,11 @@ void statement_restrictions::prepare_indexed_global(const schema& idx_tbl_schema
_idx_tbl_ck_prefix->reserve(_idx_tbl_ck_prefix->size() + idx_tbl_schema.clustering_key_size());
auto *single_column_partition_key_restrictions = std::get_if<single_column_partition_range_restrictions>(&_partition_range_restrictions);
if (single_column_partition_key_restrictions) {
for (const auto& e : single_column_partition_key_restrictions->per_column_restrictions) {
const auto col = require_on_single_column(e);
const auto pos = _schema->position(*col) + 1;
(*_idx_tbl_ck_prefix)[pos] = replace_column_def(e, &idx_tbl_schema.clustering_column_at(pos));
}
for (const auto& e : single_column_partition_key_restrictions->per_column_restrictions) {
const auto col = require_on_single_column(e);
const auto pos = _schema->position(*col) + 1;
(*_idx_tbl_ck_prefix)[pos] = replace_column_def(e, &idx_tbl_schema.clustering_column_at(pos));
}
}
if (std::ranges::any_of(*_idx_tbl_ck_prefix | std::views::drop(1) | std::views::transform(&predicate::filter), is_empty_restriction)) {
@@ -2621,10 +2621,10 @@ statement_restrictions::build_get_global_index_clustering_ranges_fn() const {
return {};
}
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_global_index_clustering_ranges(
@@ -2643,14 +2643,14 @@ statement_restrictions::build_get_global_index_token_clustering_ranges_fn() cons
// In old indexes the token column was of type blob.
// This causes problems with sorting and must be handled separately.
if (token_column.type != long_type) {
return [&] (const query_options& options) {
return get_index_v1_token_range_clustering_bounds(options, token_column, _idx_tbl_ck_prefix->at(0));
};
return [&] (const query_options& options) {
return get_index_v1_token_range_clustering_bounds(options, token_column, _idx_tbl_ck_prefix->at(0));
};
}
return [&] (const query_options& options) {
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_global_index_token_clustering_ranges(
@@ -2664,10 +2664,10 @@ statement_restrictions::build_get_local_index_clustering_ranges_fn() const {
return {};
}
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
return [&] (const query_options& options) {
// Multi column restrictions are not added to _idx_tbl_ck_prefix, they are handled later by filtering.
return get_single_column_clustering_bounds(options, *_view_schema, *_idx_tbl_ck_prefix);
};
}
std::vector<query::clustering_range> statement_restrictions::get_local_index_clustering_ranges(

View File

@@ -351,6 +351,9 @@ public:
if (agg.state_to_result_function) {
ret.push_back(agg.state_to_result_function);
}
if (agg.state_reduction_function) {
ret.push_back(agg.state_reduction_function);
}
}
}
return false;

View File

@@ -71,7 +71,7 @@ future<shared_ptr<result_message>> modification_statement::execute_without_check
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&mutate_result)) {
bool is_write = true;
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write);
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write, coordinator.get().get_stats());
}
utils::get_local_injector().inject("sc_modification_statement_timeout", [&] {
throw exceptions::mutation_write_timeout_exception{"", "", options.get_consistency(), 0, 0, db::write_type::SIMPLE};

View File

@@ -47,7 +47,7 @@ future<::shared_ptr<result_message>> select_statement::do_execute(query_processo
using namespace service::strong_consistency;
if (const auto* redirect = get_if<need_redirect>(&query_result)) {
bool is_write = false;
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write);
co_return co_await redirect_statement(qp, options, redirect->target, timeout, is_write, coordinator.get().get_stats());
}
co_return co_await process_results(get<lw_shared_ptr<query::result>>(std::move(query_result)),

View File

@@ -12,19 +12,23 @@
#include "cql3/query_processor.hh"
#include "replica/database.hh"
#include "locator/tablet_replication_strategy.hh"
#include "service/strong_consistency/coordinator.hh"
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(query_processor& qp,
const query_options& options,
const locator::tablet_replica& target,
db::timeout_clock::time_point timeout,
bool is_write)
bool is_write,
service::strong_consistency::stats& stats)
{
auto&& func_values_cache = const_cast<cql3::query_options&>(options).take_cached_pk_function_calls();
const auto my_host_id = qp.db().real_database().get_token_metadata().get_topology().my_host_id();
if (target.host != my_host_id) {
++(is_write ? stats.write_node_bounces : stats.read_node_bounces);
co_return qp.bounce_to_node(target, std::move(func_values_cache), timeout, is_write);
}
++(is_write ? stats.write_shard_bounces : stats.read_shard_bounces);
co_return qp.bounce_to_shard(target.shard, std::move(func_values_cache));
}

View File

@@ -11,6 +11,8 @@
#include "cql3/cql_statement.hh"
#include "locator/tablets.hh"
namespace service::strong_consistency { struct stats; }
namespace cql3::statements::strong_consistency {
future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement(
@@ -18,7 +20,8 @@ future<::shared_ptr<cql_transport::messages::result_message>> redirect_statement
const query_options& options,
const locator::tablet_replica& target,
db::timeout_clock::time_point timeout,
bool is_write);
bool is_write,
service::strong_consistency::stats& stats);
bool is_strongly_consistent(data_dictionary::database db, std::string_view ks_name);

View File

@@ -1429,6 +1429,13 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_shard_aware_drivers(this, "enable_shard_aware_drivers", value_status::Used, true, "Enable native transport drivers to use connection-per-shard for better performance.")
, enable_ipv6_dns_lookup(this, "enable_ipv6_dns_lookup", value_status::Used, false, "Use IPv6 address resolution")
, abort_on_internal_error(this, "abort_on_internal_error", liveness::LiveUpdate, value_status::Used, false, "Abort the server instead of throwing exception when internal invariants are violated.")
, abort_on_malformed_sstable_error(this, "abort_on_malformed_sstable_error", liveness::LiveUpdate, value_status::Used,
#if defined(DEBUG) || defined(DEVEL)
true,
#else
false,
#endif
"Abort the server and generate a coredump instead of throwing an exception when any sstable parse error is detected (malformed_sstable_exception, bufsize_mismatch_exception, parse_assert() failures, or BTI parse errors). Intended for debugging memory corruption that may manifest as sstable corruption. Defaults to true in debug and dev builds.")
, max_partition_key_restrictions_per_query(this, "max_partition_key_restrictions_per_query", liveness::LiveUpdate, value_status::Used, 100,
"Maximum number of distinct partition keys restrictions per query. This limit places a bound on the size of IN tuples, "
"especially when multiple partition key columns have IN restrictions. Increasing this value can result in server instability.")

View File

@@ -456,6 +456,7 @@ public:
named_value<bool> enable_shard_aware_drivers;
named_value<bool> enable_ipv6_dns_lookup;
named_value<bool> abort_on_internal_error;
named_value<bool> abort_on_malformed_sstable_error;
named_value<uint32_t> max_partition_key_restrictions_per_query;
named_value<uint32_t> max_clustering_key_restrictions_per_query;
named_value<uint64_t> max_memory_for_unlimited_query_soft_limit;

View File

@@ -29,6 +29,9 @@ class large_data_handler {
public:
struct stats {
int64_t partitions_bigger_than_threshold = 0; // number of large partition updates exceeding threshold_bytes
int64_t rows_bigger_than_threshold = 0; // number of large row updates exceeding row_threshold_bytes
int64_t cells_bigger_than_threshold = 0; // number of large cell updates exceeding cell_threshold_bytes
int64_t collections_bigger_than_threshold = 0; // number of large collection updates exceeding collection_elements_count_threshold
};
private:
@@ -82,6 +85,7 @@ public:
const clustering_key_prefix* clustering_key, uint64_t row_size) {
SCYLLA_ASSERT(running());
if (row_size > _row_threshold_bytes) [[unlikely]] {
++_stats.rows_bigger_than_threshold;
return with_sem([&sst, &partition_key, clustering_key, row_size, this] {
return record_large_rows(sst, partition_key, clustering_key, row_size);
}).then([] {
@@ -102,6 +106,8 @@ public:
const clustering_key_prefix* clustering_key, const column_definition& cdef, uint64_t cell_size, uint64_t collection_elements) {
SCYLLA_ASSERT(running());
above_threshold_result above_threshold{.size = cell_size > _cell_threshold_bytes, .elements = collection_elements > _collection_elements_count_threshold};
_stats.cells_bigger_than_threshold += above_threshold.size;
_stats.collections_bigger_than_threshold += above_threshold.elements;
if (above_threshold.size || above_threshold.elements) [[unlikely]] {
return with_sem([&sst, &partition_key, clustering_key, &cdef, cell_size, collection_elements, this] {
return record_large_cells(sst, partition_key, clustering_key, cdef, cell_size, collection_elements);

View File

@@ -17,7 +17,6 @@
#include "db/snapshot-ctl.hh"
#include "db/snapshot/backup_task.hh"
#include "schema/schema_fwd.hh"
#include "sstables/exceptions.hh"
#include "sstables/sstables.hh"
#include "sstables/sstable_directory.hh"
#include "sstables/sstables_manager.hh"
@@ -164,22 +163,23 @@ future<> backup_task_impl::process_snapshot_dir() {
auto file_path = _snapshot_dir / name;
auto st = co_await file_stat(directory, name);
total += st.size;
try {
auto desc = sstables::parse_path(file_path, "", "");
const auto& gen = desc.generation;
_sstable_comps[gen].emplace_back(name);
_sstables_in_snapshot.insert(desc.generation);
++num_sstable_comps;
// When the SSTable is only linked-to by the snapshot directory,
// it is already deleted from the table's base directory, and
// therefore it better be uploaded earlier to free-up its capacity.
if (desc.component == sstables::component_type::Data && st.number_of_links == 1) {
snap_log.debug("backup_task: SSTable with generation {} is already deleted from the table", gen);
_deleted_sstables.push_back(gen);
}
} catch (const sstables::malformed_sstable_exception&) {
auto result = sstables::parse_path(file_path, "", "");
if (!result) {
_files.emplace_back(name);
continue;
}
auto desc = std::move(*result);
const auto& gen = desc.generation;
_sstable_comps[gen].emplace_back(name);
_sstables_in_snapshot.insert(desc.generation);
++num_sstable_comps;
// When the SSTable is only linked-to by the snapshot directory,
// it is already deleted from the table's base directory, and
// therefore it better be uploaded earlier to free-up its capacity.
if (desc.component == sstables::component_type::Data && st.number_of_links == 1) {
snap_log.debug("backup_task: SSTable with generation {} is already deleted from the table", gen);
_deleted_sstables.push_back(gen);
}
}
_total_progress.total = total;

View File

@@ -96,6 +96,52 @@ schema_ptr cdc_timestamps() {
static const sstring CDC_TIMESTAMPS_KEY = "timestamps";
schema_ptr service_levels() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS, std::make_optional(id))
.with_column("service_level", utf8_type, column_kind::partition_key)
.with_column("timeout", duration_type)
.with_column("workload_type", utf8_type)
.with_column("shares", int32_type)
.with_hash_version()
.build();
}();
return schema;
}
schema_ptr snapshot_sstables() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::SNAPSHOT_SSTABLES);
return schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SNAPSHOT_SSTABLES, std::make_optional(id))
// Name of the snapshot
.with_column("snapshot_name", utf8_type, column_kind::partition_key)
// Keyspace where the snapshot was taken
.with_column("keyspace", utf8_type, column_kind::partition_key)
// Table within the keyspace
.with_column("table", utf8_type, column_kind::partition_key)
// Datacenter where this SSTable is located
.with_column("datacenter", utf8_type, column_kind::partition_key)
// Rack where this SSTable is located
.with_column("rack", utf8_type, column_kind::partition_key)
// First token in the token range covered by this SSTable
.with_column("first_token", long_type, column_kind::clustering_key)
// Unique identifier for the SSTable (UUID)
.with_column("sstable_id", uuid_type, column_kind::clustering_key)
// Last token in the token range covered by this SSTable
.with_column("last_token", long_type)
// TOC filename of the SSTable
.with_column("toc_name", utf8_type)
// Prefix path in object storage where the SSTable was backed up
.with_column("prefix", utf8_type)
// Flag if the SSTable was downloaded already
.with_column("downloaded", boolean_type)
.with_hash_version()
.build();
}();
return schema;
}
// This is the set of tables which this node ensures to exist in the cluster.
// It does that by announcing the creation of these schemas on initialization
// of the `system_distributed_keyspace` service (see `start()`), unless it first
@@ -111,11 +157,13 @@ static std::vector<schema_ptr> ensured_tables() {
view_build_status(),
cdc_desc(),
cdc_timestamps(),
service_levels(),
snapshot_sstables(),
};
}
std::vector<schema_ptr> system_distributed_keyspace::all_distributed_tables() {
return {view_build_status(), cdc_desc(), cdc_timestamps()};
return {view_build_status(), cdc_desc(), cdc_timestamps(), service_levels(), snapshot_sstables()};
}
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm, service::storage_proxy& sp)
@@ -400,4 +448,83 @@ system_distributed_keyspace::cdc_current_generation_timestamp(context ctx) {
co_return timestamp_cql->one().get_as<db_clock::time_point>("time");
}
future<> system_distributed_keyspace::insert_snapshot_sstable(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, sstables::sstable_id sstable_id, dht::token first_token, dht::token last_token, sstring toc_name, sstring prefix, db::consistency_level cl) {
// Not inserting the downloaded column so that re-populating on restore
// retry doesn't overwrite downloaded=true set by a previous attempt
static const sstring query = format("INSERT INTO {}.{} (snapshot_name, \"keyspace\", \"table\", datacenter, rack, first_token, sstable_id, last_token, toc_name, prefix) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?) USING TTL {}", NAME, SNAPSHOT_SSTABLES, SNAPSHOT_SSTABLES_TTL_SECONDS);
return _qp.execute_internal(
query,
cl,
internal_distributed_query_state(),
{ std::move(snapshot_name), std::move(ks), std::move(table), std::move(dc), std::move(rack),
dht::token::to_int64(first_token), sstable_id.uuid(), dht::token::to_int64(last_token), std::move(toc_name), std::move(prefix) },
cql3::query_processor::cache_internal::yes).discard_result();
}
future<utils::chunked_vector<snapshot_sstable_entry>>
system_distributed_keyspace::get_snapshot_sstables(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, db::consistency_level cl, std::optional<dht::token> start_token, std::optional<dht::token> end_token) const {
utils::chunked_vector<snapshot_sstable_entry> sstables;
static const sstring base_query = format("SELECT toc_name, prefix, sstable_id, first_token, last_token, downloaded FROM {}.{}"
" WHERE snapshot_name = ? AND \"keyspace\" = ? AND \"table\" = ? AND datacenter = ? AND rack = ?", NAME, SNAPSHOT_SSTABLES);
auto read_row = [&] (const cql3::untyped_result_set_row& row) {
sstables.emplace_back(sstables::sstable_id(row.get_as<utils::UUID>("sstable_id")), dht::token::from_int64(row.get_as<int64_t>("first_token")), dht::token::from_int64(row.get_as<int64_t>("last_token")), row.get_as<sstring>("toc_name"), row.get_as<sstring>("prefix"), is_downloaded(row.get_or<bool>("downloaded", false)));
return make_ready_future<stop_iteration>(stop_iteration::no);
};
if (start_token && end_token) {
co_await _qp.query_internal(
base_query + " AND first_token >= ? AND first_token <= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*start_token), dht::token::to_int64(*end_token) },
1000,
read_row);
} else if (start_token) {
co_await _qp.query_internal(
base_query + " AND first_token >= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*start_token) },
1000,
read_row);
} else if (end_token) {
co_await _qp.query_internal(
base_query + " AND first_token <= ?",
cl,
{ snapshot_name, ks, table, dc, rack, dht::token::to_int64(*end_token) },
1000,
read_row);
} else {
co_await _qp.query_internal(
base_query,
cl,
{ snapshot_name, ks, table, dc, rack },
1000,
read_row);
}
co_return sstables;
}
future<> system_distributed_keyspace::update_sstable_download_status(sstring snapshot_name,
sstring ks,
sstring table,
sstring dc,
sstring rack,
sstables::sstable_id sstable_id,
dht::token start_token,
is_downloaded downloaded) const {
static const sstring update_query = format("UPDATE {}.{} USING TTL {} SET downloaded = ? WHERE snapshot_name = ? AND \"keyspace\" = ? AND \"table\" = ? AND "
"datacenter = ? AND rack = ? AND first_token = ? AND sstable_id = ?",
NAME,
SNAPSHOT_SSTABLES,
SNAPSHOT_SSTABLES_TTL_SECONDS);
co_await _qp.execute_internal(update_query,
consistency_level::ONE,
internal_distributed_query_state(),
{downloaded == is_downloaded::yes ? true : false, snapshot_name, ks, table, dc, rack, dht::token::to_int64(start_token), sstable_id.uuid()},
cql3::query_processor::cache_internal::no);
}
} // namespace db

View File

@@ -9,11 +9,17 @@
#pragma once
#include "schema/schema_fwd.hh"
#include "utils/chunked_vector.hh"
#include "db/consistency_level_type.hh"
#include "locator/host_id.hh"
#include "dht/token.hh"
#include "sstables/types.hh"
#include <seastar/core/future.hh>
#include <seastar/core/sstring.hh>
#include <seastar/util/bool_class.hh>
#include <optional>
#include <unordered_map>
namespace cql3 {
@@ -30,13 +36,26 @@ namespace service {
class migration_manager;
}
namespace db {
using is_downloaded = bool_class<class is_downloaded_tag>;
struct snapshot_sstable_entry {
sstables::sstable_id sstable_id;
dht::token first_token;
dht::token last_token;
sstring toc_name;
sstring prefix;
is_downloaded downloaded{is_downloaded::no};
};
class system_distributed_keyspace {
public:
static constexpr auto NAME = "system_distributed";
static constexpr auto VIEW_BUILD_STATUS = "view_build_status";
static constexpr auto SERVICE_LEVELS = "service_levels";
/* This table is used by CDC clients to learn about available CDC streams. */
static constexpr auto CDC_DESC_V2 = "cdc_streams_descriptions_v2";
@@ -49,6 +68,12 @@ public:
* in the old table also appear in the new table, if necessary. */
static constexpr auto CDC_DESC_V1 = "cdc_streams_descriptions";
/* This table is used by the backup and restore code to store per-sstable metadata.
* The data the coordinator node puts in this table comes from the snapshot manifests. */
static constexpr auto SNAPSHOT_SSTABLES = "snapshot_sstables";
static constexpr uint64_t SNAPSHOT_SSTABLES_TTL_SECONDS = std::chrono::seconds(std::chrono::days(3)).count();
/* Information required to modify/query some system_distributed tables, passed from the caller. */
struct context {
/* How many different token owners (endpoints) are there in the token ring? */
@@ -87,6 +112,26 @@ public:
// NOTE: currently used only by alternator
future<db_clock::time_point> cdc_current_generation_timestamp(context);
/* Inserts a single SSTable entry for a given snapshot, keyspace, table, datacenter,
* and rack. The row is written with the specified TTL (in seconds). Uses consistency
* level `EACH_QUORUM` by default.*/
future<> insert_snapshot_sstable(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, sstables::sstable_id sstable_id, dht::token first_token, dht::token last_token, sstring toc_name, sstring prefix, db::consistency_level cl = db::consistency_level::EACH_QUORUM);
/* Retrieves all SSTable entries for a given snapshot, keyspace, table, datacenter, and rack.
* If `start_token` and `end_token` are provided, only entries whose `first_token` is in the range [`start_token`, `end_token`] will be returned.
* Returns a vector of `snapshot_sstable_entry` structs containing `sstable_id`, `first_token`, `last_token`,
* `toc_name`, and `prefix`. Uses consistency level `LOCAL_QUORUM` by default. */
future<utils::chunked_vector<snapshot_sstable_entry>> get_snapshot_sstables(sstring snapshot_name, sstring ks, sstring table, sstring dc, sstring rack, db::consistency_level cl = db::consistency_level::LOCAL_QUORUM, std::optional<dht::token> start_token = std::nullopt, std::optional<dht::token> end_token = std::nullopt) const;
future<> update_sstable_download_status(sstring snapshot_name,
sstring ks,
sstring table,
sstring dc,
sstring rack,
sstables::sstable_id sstable_id,
dht::token start_token,
is_downloaded downloaded) const;
private:
future<> create_tables(std::vector<schema_ptr> tables);
};

View File

@@ -1146,7 +1146,8 @@ schema_ptr system_keyspace::sstables_registry() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, SSTABLES_REGISTRY);
return schema_builder(NAME, SSTABLES_REGISTRY, id)
.with_column("owner", uuid_type, column_kind::partition_key)
.with_column("table_id", uuid_type, column_kind::partition_key)
.with_column("node_owner", uuid_type, column_kind::partition_key)
.with_column("generation", timeuuid_type, column_kind::clustering_key)
.with_column("status", utf8_type)
.with_column("state", utf8_type)
@@ -1309,6 +1310,7 @@ schema_ptr system_keyspace::view_building_tasks() {
return schema_builder(NAME, VIEW_BUILDING_TASKS, std::make_optional(id))
.with_column("key", utf8_type, column_kind::partition_key)
.with_column("id", timeuuid_type, column_kind::clustering_key)
.with_column("min_task_id", timeuuid_type, column_kind::static_column)
.with_column("type", utf8_type)
.with_column("aborted", boolean_type)
.with_column("base_id", uuid_type)
@@ -2749,12 +2751,36 @@ future<mutation> system_keyspace::make_remove_view_build_status_on_host_mutation
static constexpr auto VIEW_BUILDING_KEY = "view_building";
future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
static const sstring query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
future<std::pair<db::view::building_tasks, std::optional<utils::UUID>>> system_keyspace::get_view_building_tasks() {
using namespace db::view;
// When the VIEW_BUILDING_TASKS_MIN_TASK_ID feature is active, read the static
// column min_task_id first and use it as a lower bound for the clustering row
// scan. This skips tombstoned rows below the boundary, avoiding dead-cell
// warnings from the tombstone_warn_threshold check.
std::optional<utils::UUID> min_task_id;
if (_db.features().view_building_tasks_min_task_id) {
auto schema = view_building_tasks();
auto pk = partition_key::from_single_value(*schema, data_value(VIEW_BUILDING_KEY).serialize_nonnull());
auto dk = dht::decorate_key(*schema, pk);
auto col_id = schema->get_column_definition("min_task_id")->id;
query::partition_slice slice(
query::clustering_row_ranges{},
{col_id},
{},
query::partition_slice::option_set::of<query::partition_slice::option::always_return_static_content>());
auto cmd = query::read_command(schema->id(), schema->version(), slice,
_db.get_query_max_result_size(), query::tombstone_limit::max);
auto [qr, _cache_temp] = co_await _db.query(schema, cmd, query::result_options::only_result(),
{dht::partition_range::make_singular(dk)}, nullptr, db::no_timeout);
auto rs = query::result_set::from_raw_result(schema, slice, *qr);
if (!rs.empty()) {
min_task_id = rs.row(0).get<utils::UUID>("min_task_id");
}
}
building_tasks tasks;
co_await _qp.query_internal(query, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto process_row = [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto id = row.get_as<utils::UUID>("id");
auto type = task_type_from_string(row.get_as<sstring>("type"));
auto aborted = row.get_as<bool>("aborted");
@@ -2779,8 +2805,18 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
break;
}
co_return stop_iteration::no;
});
co_return tasks;
};
if (min_task_id) {
static const sstring bounded_query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}' AND id >= ?",
NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
co_await _qp.query_internal(bounded_query, db::consistency_level::LOCAL_ONE, {*min_task_id}, 1000, std::move(process_row));
} else {
static const sstring full_query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'",
NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
co_await _qp.query_internal(full_query, std::move(process_row));
}
co_return std::pair{std::move(tasks), std::move(min_task_id)};
}
future<mutation> system_keyspace::make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task) {
@@ -3473,37 +3509,37 @@ system_keyspace::read_cdc_generation_opt(utils::UUID id) {
co_return cdc::topology_description{std::move(entries)};
}
future<> system_keyspace::sstables_registry_create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) {
static const auto req = format("INSERT INTO system.{} (owner, generation, status, state, version, format) VALUES (?, ?, ?, ?, ?, ?)", SSTABLES_REGISTRY);
slogger.trace("Inserting {}.{} into {}", owner, desc.generation, SSTABLES_REGISTRY);
co_await execute_cql(req, owner.id, desc.generation, status, sstables::state_to_dir(state), fmt::to_string(desc.version), fmt::to_string(desc.format)).discard_result();
future<> system_keyspace::sstables_registry_create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) {
static const auto req = format("INSERT INTO system.{} (table_id, node_owner, generation, status, state, version, format) VALUES (?, ?, ?, ?, ?, ?, ?)", SSTABLES_REGISTRY);
slogger.trace("Inserting {}.{}.{} into {}", tid, node_owner, desc.generation, SSTABLES_REGISTRY);
co_await execute_cql(req, tid.id, node_owner.uuid(), desc.generation, status, sstables::state_to_dir(state), fmt::to_string(desc.version), fmt::to_string(desc.format)).discard_result();
}
future<> system_keyspace::sstables_registry_update_entry_status(table_id owner, sstables::generation_type gen, sstring status) {
static const auto req = format("UPDATE system.{} SET status = ? WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Updating {}.{} -> status={} in {}", owner, gen, status, SSTABLES_REGISTRY);
co_await execute_cql(req, status, owner.id, gen).discard_result();
future<> system_keyspace::sstables_registry_update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status) {
static const auto req = format("UPDATE system.{} SET status = ? WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Updating {}.{}.{} -> status={} in {}", tid, node_owner, gen, status, SSTABLES_REGISTRY);
co_await execute_cql(req, status, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state) {
static const auto req = format("UPDATE system.{} SET state = ? WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
future<> system_keyspace::sstables_registry_update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state) {
static const auto req = format("UPDATE system.{} SET state = ? WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
auto new_state = sstables::state_to_dir(state);
slogger.trace("Updating {}.{} -> state={} in {}", owner, gen, new_state, SSTABLES_REGISTRY);
co_await execute_cql(req, new_state, owner.id, gen).discard_result();
slogger.trace("Updating {}.{}.{} -> state={} in {}", tid, node_owner, gen, new_state, SSTABLES_REGISTRY);
co_await execute_cql(req, new_state, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_delete_entry(table_id owner, sstables::generation_type gen) {
static const auto req = format("DELETE FROM system.{} WHERE owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Removing {}.{} from {}", owner, gen, SSTABLES_REGISTRY);
co_await execute_cql(req, owner.id, gen).discard_result();
future<> system_keyspace::sstables_registry_delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen) {
static const auto req = format("DELETE FROM system.{} WHERE table_id = ? AND node_owner = ? AND generation = ?", SSTABLES_REGISTRY);
slogger.trace("Removing {}.{}.{} from {}", tid, node_owner, gen, SSTABLES_REGISTRY);
co_await execute_cql(req, tid.id, node_owner.uuid(), gen).discard_result();
}
future<> system_keyspace::sstables_registry_list(table_id owner, sstable_registry_entry_consumer consumer) {
static const auto req = format("SELECT status, state, generation, version, format FROM system.{} WHERE owner = ?", SSTABLES_REGISTRY);
slogger.trace("Listing {} entries from {}", owner, SSTABLES_REGISTRY);
future<> system_keyspace::sstables_registry_list(table_id tid, locator::host_id node_owner, sstable_registry_entry_consumer consumer) {
static const auto req = format("SELECT status, state, generation, version, format FROM system.{} WHERE table_id = ? AND node_owner = ?", SSTABLES_REGISTRY);
slogger.trace("Listing {}.{} entries from {}", tid, node_owner, SSTABLES_REGISTRY);
co_await _qp.query_internal(req, db::consistency_level::ONE, { owner.id }, 1000, [ consumer = std::move(consumer) ] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
co_await _qp.query_internal(req, db::consistency_level::ONE, { tid.id, node_owner.uuid() }, 1000, [ consumer = std::move(consumer) ] (const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto status = row.get_as<sstring>("status");
auto state = sstables::state_from_dir(row.get_as<sstring>("state"));
auto gen = sstables::generation_type(row.get_as<utils::UUID>("generation"));

View File

@@ -572,7 +572,7 @@ public:
future<mutation> make_remove_view_build_status_on_host_mutation(api::timestamp_type ts, system_keyspace_view_name view_name, locator::host_id host_id);
// system.view_building_tasks
future<db::view::building_tasks> get_view_building_tasks();
future<std::pair<db::view::building_tasks, std::optional<utils::UUID>>> get_view_building_tasks();
future<mutation> make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task);
future<mutation> make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id);
@@ -671,12 +671,12 @@ public:
future<mutation> make_view_builder_version_mutation(api::timestamp_type ts, view_builder_version_t version);
future<view_builder_version_t> get_view_builder_version();
future<> sstables_registry_create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc);
future<> sstables_registry_update_entry_status(table_id owner, sstables::generation_type gen, sstring status);
future<> sstables_registry_update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state);
future<> sstables_registry_delete_entry(table_id owner, sstables::generation_type gen);
future<> sstables_registry_create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc);
future<> sstables_registry_update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status);
future<> sstables_registry_update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state);
future<> sstables_registry_delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen);
using sstable_registry_entry_consumer = sstables::sstables_registry::entry_consumer;
future<> sstables_registry_list(table_id owner, sstable_registry_entry_consumer consumer);
future<> sstables_registry_list(table_id tid, locator::host_id node_owner, sstable_registry_entry_consumer consumer);
future<std::optional<sstring>> load_group0_upgrade_state();
future<> save_group0_upgrade_state(sstring);

View File

@@ -15,24 +15,24 @@ class system_keyspace_sstables_registry : public sstables::sstables_registry {
public:
system_keyspace_sstables_registry(system_keyspace& keyspace) : _keyspace(keyspace.shared_from_this()) {}
virtual seastar::future<> create_entry(table_id owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) override {
return _keyspace->sstables_registry_create_entry(owner, status, state, desc);
virtual seastar::future<> create_entry(table_id tid, locator::host_id node_owner, sstring status, sstables::sstable_state state, sstables::entry_descriptor desc) override {
return _keyspace->sstables_registry_create_entry(tid, node_owner, status, state, desc);
}
virtual seastar::future<> update_entry_status(table_id owner, sstables::generation_type gen, sstring status) override {
return _keyspace->sstables_registry_update_entry_status(owner, gen, status);
virtual seastar::future<> update_entry_status(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstring status) override {
return _keyspace->sstables_registry_update_entry_status(tid, node_owner, gen, status);
}
virtual seastar::future<> update_entry_state(table_id owner, sstables::generation_type gen, sstables::sstable_state state) override {
return _keyspace->sstables_registry_update_entry_state(owner, gen, state);
virtual seastar::future<> update_entry_state(table_id tid, locator::host_id node_owner, sstables::generation_type gen, sstables::sstable_state state) override {
return _keyspace->sstables_registry_update_entry_state(tid, node_owner, gen, state);
}
virtual seastar::future<> delete_entry(table_id owner, sstables::generation_type gen) override {
return _keyspace->sstables_registry_delete_entry(owner, gen);
virtual seastar::future<> delete_entry(table_id tid, locator::host_id node_owner, sstables::generation_type gen) override {
return _keyspace->sstables_registry_delete_entry(tid, node_owner, gen);
}
virtual seastar::future<> sstables_registry_list(table_id owner, entry_consumer consumer) override {
return _keyspace->sstables_registry_list(owner, std::move(consumer));
virtual seastar::future<> sstables_registry_list(table_id tid, locator::host_id node_owner, entry_consumer consumer) override {
return _keyspace->sstables_registry_list(tid, node_owner, std::move(consumer));
}
};

View File

@@ -11,6 +11,7 @@
#include <exception>
#include <ranges>
#include <seastar/core/abort_source.hh>
#include <seastar/coroutine/maybe_yield.hh>
#include <seastar/coroutine/parallel_for_each.hh>
#include <seastar/core/on_internal_error.hh>
#include "db/view/view_building_coordinator.hh"
@@ -179,7 +180,10 @@ future<> view_building_coordinator::clean_finished_tasks() {
co_return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
// Collect tasks eligible for deletion: must still be in state and not aborted.
std::vector<utils::UUID> tasks_to_delete;
for (auto& [replica, tasks]: _finished_tasks) {
for (auto& task_id: tasks) {
// The task might be aborted in the meantime. In this case we cannot remove it because we need it to create a new task.
@@ -189,15 +193,65 @@ future<> view_building_coordinator::clean_finished_tasks() {
// If yes, we can just remove it instead of aborting it.
auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, task_id);
if (task_opt && !task_opt->get().aborted) {
builder.del_task(task_id);
vbc_logger.debug("Removing finished task with ID: {}", task_id);
tasks_to_delete.push_back(task_id);
}
}
}
co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
for (auto& [_, tasks_set]: _finished_tasks) {
tasks_set.clear();
if (!tasks_to_delete.empty()) {
// Find the minimum UUID (by timeuuid ordering) among tasks that are NOT being
// deleted — i.e., alive tasks that must remain in the table.
// Everything strictly below this boundary is safe to cover with one range tombstone.
const std::unordered_set<utils::UUID> to_delete_set(tasks_to_delete.begin(), tasks_to_delete.end());
std::optional<utils::UUID> min_alive_uuid;
for (auto& [base_id, base_tasks] : _vb_sm.building_state.tasks_state) {
for (auto& [replica, rep_tasks] : base_tasks) {
auto check = [&](const utils::UUID& id) {
if (!to_delete_set.contains(id)
&& (!min_alive_uuid || timeuuid_tri_compare(id, *min_alive_uuid) < 0)) {
min_alive_uuid = id;
}
};
for (auto& [id, task] : rep_tasks.staging_tasks) {
check(id);
}
for (auto& [view_id, task_m] : rep_tasks.view_tasks) {
for (auto& [id, task] : task_m) {
check(id);
}
}
co_await coroutine::maybe_yield();
}
}
if (min_alive_uuid) {
vbc_logger.debug("Removing finished tasks before ID: {} using range tombstone", *min_alive_uuid);
builder.del_tasks_before(*min_alive_uuid);
for (auto& task_id : tasks_to_delete) {
// Tasks below min_alive_uuid are already covered by the range tombstone.
if (timeuuid_tri_compare(task_id, *min_alive_uuid) < 0) {
continue;
}
vbc_logger.debug("Removing finished task with ID: {}", task_id);
builder.del_task(task_id);
}
} else {
// No alive tasks remain — one range tombstone covers everything.
vbc_logger.debug("No alive tasks remain, removing all finished tasks using range tombstone");
builder.del_all_tasks();
}
if (_db.features().view_building_tasks_min_task_id) {
// If min_alive_uuid == std::nullopt, set min_task_id to a fresh UUID,
// so future scans start past all the just-deleted rows (new tasks created
// later will have larger UUIDs).
builder.set_min_task_id(min_alive_uuid ? *min_alive_uuid : utils::UUID_gen::get_time_UUID());
}
co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
for (auto& [_, tasks_set]: _finished_tasks) {
tasks_set.clear();
}
}
}
@@ -533,7 +587,7 @@ void view_building_coordinator::generate_tablet_migration_updates(utils::chunked
}
auto last_token = tmap.get_last_token(gid.tablet);
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto create_task_copy_on_pending_replica = [&] (const view_building_task& task) {
auto new_id = builder.new_id();
@@ -601,7 +655,7 @@ void view_building_coordinator::generate_tablet_resize_updates(utils::chunked_ve
return;
}
bool is_split = old_tmap.tablet_count() < new_tmap.tablet_count();
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto create_task_copy = [&] (const view_building_task& task, dht::token last_token) -> utils::UUID {
auto new_id = builder.new_id();
@@ -671,7 +725,7 @@ void view_building_coordinator::abort_tasks(utils::chunked_vector<canonical_muta
}
vbc_logger.debug("Generating abort mutations for tasks for table {}", table_id);
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto abort_task_map = [&] (const task_map& task_map) {
for (auto& [id, _]: task_map) {
vbc_logger.debug("Aborting task {}", id);
@@ -700,7 +754,7 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,
}
vbc_logger.debug("Generating abort mutations for tasks for table {} on replica {} and last token {}", table_id, replica, last_token);
view_building_task_mutation_builder builder(write_timestamp);
view_building_task_mutation_builder builder(write_timestamp, vb_sm.building_state.make_task_uuid_generator(write_timestamp));
auto abort_task_map = [&] (const task_map& task_map) {
for (auto& [id, task]: task_map) {
if (task.last_token == last_token) {
@@ -742,7 +796,7 @@ void view_building_coordinator::rollback_aborted_tasks(utils::chunked_vector<can
return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto& base_tasks = _vb_sm.building_state.tasks_state.at(table_id);
for (auto& [_, replica_tasks]: base_tasks) {
for (auto& [_, building_task_map]: replica_tasks.view_tasks) {
@@ -759,7 +813,7 @@ void view_building_coordinator::rollback_aborted_tasks(utils::chunked_vector<can
return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
view_building_task_mutation_builder builder(guard.write_timestamp(), _vb_sm.building_state.make_task_uuid_generator(guard.write_timestamp()));
auto& replica_tasks = _vb_sm.building_state.tasks_state.at(table_id).at(replica);
for (auto& [_, building_task_map]: replica_tasks.view_tasks) {
rollback_task_map(builder, building_task_map);

View File

@@ -8,6 +8,7 @@
*/
#include "db/view/view_building_state.hh"
#include "utils/UUID_gen.hh"
namespace db {
@@ -22,9 +23,10 @@ view_building_task::view_building_task(utils::UUID id, task_type type, bool abor
, replica(replica)
, last_token(last_token) {}
view_building_state::view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table)
view_building_state::view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table, std::optional<utils::UUID> min_alive_uuid)
: tasks_state(std::move(tasks_state))
, currently_processed_base_table(std::move(processed_base_table)) {}
, currently_processed_base_table(std::move(processed_base_table))
, min_alive_uuid(std::move(min_alive_uuid)) {}
views_state::views_state(std::map<table_id, std::vector<table_id>> views_per_base, view_build_status_map status_map)
: views_per_base(std::move(views_per_base))
@@ -127,6 +129,24 @@ std::map<dht::token, std::vector<view_building_task>> view_building_state::colle
return tasks;
}
task_uuid_generator::task_uuid_generator(api::timestamp_type base_ts)
: _next_ts(base_ts) {}
utils::UUID task_uuid_generator::operator()() {
return utils::UUID_gen::get_random_time_UUID_from_micros(
std::chrono::microseconds{_next_ts++});
}
task_uuid_generator view_building_state::make_task_uuid_generator(api::timestamp_type ts) const {
if (min_alive_uuid) {
auto lower_bound = utils::UUID_gen::micros_timestamp(*min_alive_uuid);
if (ts <= lower_bound) {
ts = lower_bound + 1;
}
}
return task_uuid_generator{ts};
}
}
}

View File

@@ -14,6 +14,7 @@
#include "db/view/view_build_status.hh"
#include "locator/host_id.hh"
#include "locator/tablets.hh"
#include "mutation/timestamp.hh"
#include "utils/UUID.hh"
#include <fmt/base.h>
#include "schema/schema_fwd.hh"
@@ -64,6 +65,16 @@ struct replica_tasks {
using base_table_tasks = std::map<locator::tablet_replica, replica_tasks>;
using building_tasks = std::map<table_id, base_table_tasks>;
// Generates unique timeuuids with strictly increasing microsecond timestamps.
// Each call to operator() returns a new timeuuid whose timestamp is one
// microsecond greater than the previous one.
class task_uuid_generator {
api::timestamp_type _next_ts;
public:
explicit task_uuid_generator(api::timestamp_type base_ts);
utils::UUID operator()();
};
// Represents cluster-wide view building state (only for tablet-based views).
// The state stores all unfinished view building tasks for all tablet-based views
// and table_id of currently processed base table by view building coordinator.
@@ -73,14 +84,22 @@ using building_tasks = std::map<table_id, base_table_tasks>;
struct view_building_state {
building_tasks tasks_state;
std::optional<table_id> currently_processed_base_table;
std::optional<utils::UUID> min_alive_uuid;
view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table);
view_building_state(building_tasks tasks_state, std::optional<table_id> processed_base_table, std::optional<utils::UUID> min_alive_uuid);
view_building_state() = default;
std::optional<std::reference_wrapper<const view_building_task>> get_task(table_id base_id, locator::tablet_replica replica, utils::UUID id) const;
std::vector<std::reference_wrapper<const view_building_task>> get_tasks_for_host(table_id base_id, locator::host_id host) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id, const locator::tablet_replica& replica) const;
// Creates a generator that produces unique timeuuids suitable for view
// building task IDs. The generated uuids have strictly increasing
// microsecond timestamps starting from write_timestamp. If min_alive_uuid
// is set, all generated uuids are guaranteed to be greater than
// *min_alive_uuid in timeuuid order.
task_uuid_generator make_task_uuid_generator(api::timestamp_type write_timestamp) const;
};
// Represents global state of tablet-based views.

View File

@@ -14,7 +14,7 @@ namespace db {
namespace view {
utils::UUID view_building_task_mutation_builder::new_id() {
return utils::UUID_gen::get_time_UUID();
return _uuid_gen();
}
clustering_key view_building_task_mutation_builder::get_ck(utils::UUID id) {
@@ -52,6 +52,30 @@ view_building_task_mutation_builder& view_building_task_mutation_builder::del_ta
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::del_tasks_before(utils::UUID id) {
auto ck = get_ck(id);
range_tombstone rt(
position_in_partition::before_all_clustered_rows(),
position_in_partition_view(ck, bound_weight::before_all_prefixed),
tombstone{_ts, gc_clock::now()});
_m.partition().apply_row_tombstone(*_s, std::move(rt));
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::del_all_tasks() {
range_tombstone rt(
position_in_partition::before_all_clustered_rows(),
position_in_partition::after_all_clustered_rows(),
tombstone{_ts, gc_clock::now()});
_m.partition().apply_row_tombstone(*_s, std::move(rt));
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::set_min_task_id(utils::UUID id) {
_m.set_static_cell("min_task_id", data_value(id), _ts);
return *this;
}
}
}

View File

@@ -8,6 +8,7 @@
#pragma once
#include "db/view/view_building_state.hh"
#include "mutation/mutation.hh"
#include "db/system_keyspace.hh"
#include "mutation/timestamp.hh"
@@ -19,17 +20,19 @@ namespace view {
// Factory for mutations to `system.view_building_tasks` table.
class view_building_task_mutation_builder {
api::timestamp_type _ts;
task_uuid_generator _uuid_gen;
schema_ptr _s;
mutation _m;
public:
view_building_task_mutation_builder(api::timestamp_type ts)
view_building_task_mutation_builder(api::timestamp_type ts, task_uuid_generator uuid_gen)
: _ts(ts)
, _uuid_gen(std::move(uuid_gen))
, _s(db::system_keyspace::view_building_tasks())
, _m(_s, partition_key::from_single_value(*_s, data_value("view_building").serialize_nonnull()))
{ }
static utils::UUID new_id();
utils::UUID new_id();
view_building_task_mutation_builder& set_type(utils::UUID id, db::view::view_building_task::task_type type);
view_building_task_mutation_builder& set_aborted(utils::UUID id, bool aborted);
@@ -38,6 +41,12 @@ public:
view_building_task_mutation_builder& set_last_token(utils::UUID id, dht::token last_token);
view_building_task_mutation_builder& set_replica(utils::UUID id, const locator::tablet_replica& replica);
view_building_task_mutation_builder& del_task(utils::UUID id);
// Deletes all tasks with clustering key < id using a range tombstone.
view_building_task_mutation_builder& del_tasks_before(utils::UUID id);
// Deletes all tasks using a range tombstone covering the entire clustering range.
view_building_task_mutation_builder& del_all_tasks();
// Sets the static column min_task_id to `id`.
view_building_task_mutation_builder& set_min_task_id(utils::UUID id);
mutation build() {
return std::move(_m);

View File

@@ -275,11 +275,12 @@ future<> view_building_worker::create_staging_sstable_tasks() {
utils::chunked_vector<canonical_mutation> cmuts;
auto guard = co_await _group0.client().start_operation(_as);
auto uuid_gen = _vb_state_machine.building_state.make_task_uuid_generator(guard.write_timestamp());
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
for (auto& [table_id, sst_infos]: _sstables_to_register) {
for (auto& sst_info: sst_infos) {
view_building_task task {
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
uuid_gen(), view_building_task::task_type::process_staging, false,
table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
};
auto mut = co_await _sys_ks.make_view_building_task_mutation(guard.write_timestamp(), task);

View File

@@ -9,6 +9,22 @@ for f in "$etcdir"/scylla.d/*.conf; do
done
if is_privileged; then
# Override pipe-based core_pattern that may not work inside a container
# (e.g. Ubuntu host's apport). File-based patterns resolve inside the
# container's mount namespace, so coredumps land in the right place.
# Derive workdir from scylla.yaml, matching the Python entrypoint logic.
_workdir=$(python3 -c "import yaml; cfg=yaml.safe_load(open('/etc/scylla/scylla.yaml')); print(cfg.get('workdir') or '/var/lib/scylla')" 2>/dev/null || echo "/var/lib/scylla")
_coredump_dir="${_workdir}/coredump"
core_pattern=$(cat /proc/sys/kernel/core_pattern 2>/dev/null || true)
if [[ "$core_pattern" == "|"* ]]; then
if ! mkdir -p "$_coredump_dir" 2>/dev/null; then
echo "WARNING: could not create coredump directory $_coredump_dir" >&2
elif echo "${_coredump_dir}/core.%e.%p.%t" > /proc/sys/kernel/core_pattern 2>/dev/null; then
echo "kernel.core_pattern overridden to file-based pattern: ${_coredump_dir}/core.%e.%p.%t" >&2
else
echo "WARNING: pipe-based core_pattern detected but could not override. Coredumps may be lost." >&2
fi
fi
"$scriptsdir"/scylla_prepare
fi
execsudo /usr/bin/env SCYLLA_HOME=$SCYLLA_HOME SCYLLA_CONF=$SCYLLA_CONF "$bindir"/scylla $SCYLLA_ARGS $SEASTAR_IO $DEV_MODE $CPUSET $SCYLLA_DOCKER_ARGS

View File

@@ -24,6 +24,7 @@ try:
setup.developerMode()
setup.cpuSet()
setup.io()
setup.coredumpSetup()
setup.cqlshrc()
setup.write_rackdc_properties()
setup.arguments()

View File

@@ -3,6 +3,7 @@ import logging
import yaml
import os
import socket
import errno
def is_bind_mount(path):
# Check if the file or its parent is a mount point (bind mount or otherwise)
@@ -47,6 +48,7 @@ class ScyllaSetup:
self._dc = arguments.dc
self._rack = arguments.rack
self._blocked_reactor_notify_ms = arguments.blocked_reactor_notify_ms
self._coredump_dir = None
def _run(self, *args, **kwargs):
logging.info('running: {}'.format(args))
@@ -132,6 +134,70 @@ class ScyllaSetup:
f.write(f"dc={dc}\n")
f.write(f"rack={rack}\n")
CORE_PATTERN_PATH = '/proc/sys/kernel/core_pattern'
def _get_coredump_dir(self):
"""Return the coredump directory, deriving it from scylla.yaml workdir if needed."""
if self._coredump_dir is not None:
return self._coredump_dir
conf_dir = "/etc/scylla"
try:
with open(os.path.join(conf_dir, "scylla.yaml")) as f:
cfg = yaml.safe_load(f) or {}
except Exception:
cfg = {}
workdir = cfg.get('workdir') or '/var/lib/scylla'
self._coredump_dir = os.path.join(workdir, 'coredump')
return self._coredump_dir
def coredumpSetup(self):
"""Configure coredump handling for containers.
The host's kernel.core_pattern may pipe core dumps to a handler
(e.g. Ubuntu's apport) that does not exist or work correctly
inside the container. This method tries to switch to a file-based
core_pattern so that coredumps are written directly to disk.
Writing to /proc/sys/kernel/core_pattern requires privileges
(root with CAP_SYS_ADMIN). When the container lacks permission
a warning is logged with guidance for the operator.
"""
coredump_dir = self._get_coredump_dir()
try:
os.makedirs(coredump_dir, exist_ok=True)
except OSError as e:
logging.warning('Could not create coredump directory %s: %s',
coredump_dir, e)
return
try:
with open(self.CORE_PATTERN_PATH) as f:
current = f.read().strip()
except Exception as e:
logging.debug('Could not read %s: %s', self.CORE_PATTERN_PATH, e)
return
if not current.startswith('|'):
return
desired = f'{coredump_dir}/core.%e.%p.%t'
try:
with open(self.CORE_PATTERN_PATH, 'w') as f:
f.write(desired + '\n')
logging.info('kernel.core_pattern set to %s', desired)
except OSError as e:
if e.errno in (errno.EACCES, errno.EPERM, errno.EROFS):
logging.warning(
'kernel.core_pattern pipes to a program that may not work '
'inside the container, and we lack permission to override it. '
'To fix this, either run with --privileged or set on the host: '
'sysctl -w kernel.core_pattern="%s"', desired)
else:
logging.debug('Unexpected OSError setting core_pattern: %s', e)
except Exception as e:
logging.debug('Unexpected error in coredumpSetup: %s', e)
def arguments(self):
args = []
if self._memory is not None:

View File

@@ -1,5 +1,11 @@
# Alternator Vector Search
```{admonition} Availability
:class: important
The Vector Search feature is only available in [ScyllaDB Cloud](https://cloud.docs.scylladb.com/) - a fully managed DBaaS running ScyllaDB.
```
## Introduction
Alternator vector search is a ScyllaDB extension to the DynamoDB-compatible

View File

@@ -71,7 +71,7 @@ used. If it is used, the statement will be a no-op if the materialized view alre
MV Select Statement
...................
The select statement of a materialized view creation defines which of the base table is included in the view. That
The select statement of a materialized view creation defines which of the base table columns are included in the view. That
statement is limited in a number of ways:
- The :ref:`selection <selection-clause>` is limited to those that only select columns of the base table. In other

View File

@@ -167,6 +167,11 @@ All tables in a keyspace are uploaded, the destination object names will look li
or
`gs://bucket/some/prefix/to/store/data/.../sstable`
# System tables
There are a few system tables that object storage related code needs to touch in order to operate.
* [system_distributed.snapshot_sstables](docs/dev/snapshot_sstables.md) - Used during restore by worker nodes to get the list of SSTables that need to be downloaded from object storage and restored locally.
* [system.sstables](docs/dev/system_keyspace.md#systemsstables) - Used to keep track of SSTables on object storage when a keyspace is created with object storage storage_options.
# Manipulating S3 data
This section intends to give an overview of where, when and how we store data in S3 and provide a quick set of commands

View File

@@ -0,0 +1,52 @@
# system\_distributed.snapshot\_sstables
## Purpose
This table is used during tablet-aware restore to exchange per-SSTable metadata between
the coordinator and worker nodes. When the restore process starts, the coordinator node
populates this table with information about each SSTable extracted from the snapshot
manifests. Worker nodes then read from this table to determine which SSTables need to
be downloaded from object storage and restored locally.
Rows are inserted with a TTL so that stale restore metadata is automatically cleaned up.
## Schema
~~~
CREATE TABLE system_distributed.snapshot_sstables (
snapshot_name text,
"keyspace" text,
"table" text,
datacenter text,
rack text,
first_token bigint,
sstable_id uuid,
last_token bigint,
toc_name text,
prefix text,
PRIMARY KEY ((snapshot_name, "keyspace", "table", datacenter, rack), first_token, sstable_id)
)
~~~
Column descriptions:
| Column | Type | Description |
|--------|------|-------------|
| `snapshot_name` | text (partition key) | Name of the snapshot |
| `keyspace` | text (partition key) | Keyspace the snapshot was taken from |
| `table` | text (partition key) | Table within the keyspace |
| `datacenter` | text (partition key) | Datacenter where the SSTable is located |
| `rack` | text (partition key) | Rack where the SSTable is located |
| `first_token` | bigint (clustering key) | First token in the token range covered by this SSTable |
| `sstable_id` | uuid (clustering key) | Unique identifier for the SSTable |
| `last_token` | bigint | Last token in the token range covered by this SSTable |
| `toc_name` | text | TOC filename of the SSTable (e.g. `me-3gdq_0bki_2cvk01yl83nj0tp5gh-big-TOC.txt`) |
| `prefix` | text | Prefix path in object storage where the SSTable was backed up |
## APIs
The following C++ APIs are provided in `db::system_distributed_keyspace`:
- insert\_snapshot\_sstable
- get\_snapshot\_sstables

View File

@@ -274,6 +274,8 @@ globally driven by the topology change coordinator and serialized per-tablet. Tr
- repair - tablet replicas are repaired
- restore - tablet replicas download SSTables from object storage during cluster-wide backup restore
Each tablet has its own state machine for keeping state of transition stored in group0 which is part of the tablet state. It involves
these properties of a tablet:
@@ -390,6 +392,9 @@ stateDiagram-v2
The repair tablet transition kind is different. It transits only to the repair and end_repair stage because no token ownership is changed.
The restore tablet transition kind is also simple. It uses a single `restore` stage and does not change token
ownership. See the [Tablet-aware restore](#tablet-aware-restore) section below for details.
The behavioral difference between "migration" and "intranode_migration" transitions is in the way "streaming" stage
is performed. In case of intra-node migration, streaming is done by fast duplication of data by creating hard links to
sstable files on the destination shard. Original sstable files on the source shard will be removed by the standard "cleanup" stage.
@@ -984,3 +989,18 @@ Losing a committed entry can be observed by external systems. For example, the l
schema version in the cluster can go back in time from the driver's perspective. This
is outside the scope of the recovery procedure, though, and it shouldn't cause
problems in practice.
# Tablet restore transition
The `restore` tablet transition kind is used by the tablet-aware restore to download SSTables
from object storage. The transition contains `restore_config` with snapshot name, endpoint and
bucket.
Like `repair`, the `restore` transition does not change token ownership — replicas remain intact.
The topology coordinator processes a tablet in this stage by calling the `RESTORE_TABLET` RPC on
all tablet replicas. Each replica then downloads and attaches the SSTables that are contained in
the tablet's token range. If the operation succeeds or fails, the transition is cleared and the
failure to download SSTables is propagated back to user by the API handler itself.
Restore transitions are serialized per-tablet like any other transition (invariant [INV-TABL-2]),
so they do not run concurrently with migrations or repairs on the same tablet.

View File

@@ -106,6 +106,7 @@ The most important table is `system.view_building_tasks`, which stores all unfin
CREATE TABLE system.view_building_tasks (
key text,
id timeuuid,
min_task_id timeuuid STATIC, -- lower bound for task scans; see "Tombstone avoidance" below
type text,
aborted boolean,
base_id uuid,
@@ -117,6 +118,26 @@ CREATE TABLE system.view_building_tasks (
)
```
### Tombstone avoidance
`system.view_building_tasks` is a single partition. When `finished_task_gc_fiber()` removes
finished tasks in batches, the deleted rows remain as tombstones in SSTables until compaction,
causing `tombstone_warn_threshold` warnings on subsequent reloads in large clusters.
Two mechanisms address this:
**Range tombstone on GC.** Instead of one row tombstone per deleted task, the coordinator emits
a single range tombstone `[before_all, min_alive_uuid)` where `min_alive_uuid` is the smallest
timeuuid among surviving tasks. Tasks above the boundary (rare) still get individual row tombstones.
When all tasks are deleted, a single full-partition range tombstone is used.
**Bounded scan on reload.** Physical rows remain until compaction and are still counted as dead cells.
After each GC batch, `min_task_id = min_alive_uuid` is written atomically as a static cell (same Raft
batch as the range tombstone). On reload, `min_task_id` is read using a **static-only partition slice**
(empty `_row_ranges` + `always_return_static_content`) — this makes the SSTable reader stop immediately
after the static row, before any clustering tombstones, so zero dead cells are counted. The value is
then used as `AND id >= min_task_id` to skip all tombstoned rows in the main scan.
The view building coordinator stores currently processing base table in `system.scylla_local`
under `view_building_processing_base` key.
The entry is managed by group0.

View File

@@ -45,7 +45,7 @@ Example:
.. code-block:: console
nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode 675ed9f4-6564-6dbd-ca08-43fddce952de
To only mark the node as permanently down without doing actual removal, use :doc:`nodetool excludenode </operating-scylla/nodetool-commands/excludenode>`:
@@ -79,6 +79,6 @@ Example:
.. code-block:: console
nodetool removenode --ignore-dead-nodes 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1dbn-mac8-43fddce9123e 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode --ignore-dead-nodes 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c,125ed9f4-7777-1db0-aac8-43fddce9123e 675ed9f4-6564-6dbd-ca08-43fddce952de
.. include:: nodetool-index.rst

View File

@@ -74,7 +74,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UJ 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UJ 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Nodes in the cluster finished streaming data to the new node:
@@ -86,7 +86,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. When the new node status is Up Normal (UN), run the :doc:`nodetool cleanup </operating-scylla/nodetool-commands/cleanup>` command on all nodes in the cluster except for the new node that has just been added. Cleanup removes keys that were streamed to the newly added node and are no longer owned by the node.

View File

@@ -192,7 +192,7 @@ Adding new nodes
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.10 500 MB 256 33.3% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c RACK0
UN 192.168.1.11 500 MB 256 33.3% 125ed9f4-7777-1dbn-mac8-43fddce9123e RACK1
UN 192.168.1.12 500 MB 256 33.3% 675ed9f4-6564-6dbd-can8-43fddce952gy RACK2
UN 192.168.1.12 500 MB 256 33.3% 675ed9f4-6564-6dbd-ca08-43fddce952de RACK2
UJ 192.168.2.10 250 MB 256 ? a1b2c3d4-5678-90ab-cdef-112233445566 RACK0
**Example output after bootstrap completes:**
@@ -205,7 +205,7 @@ Adding new nodes
-- Address Load Tokens Owns Host ID Rack
UN 192.168.1.10 400 MB 256 25.0% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c RACK0
UN 192.168.1.11 400 MB 256 25.0% 125ed9f4-7777-1dbn-mac8-43fddce9123e RACK1
UN 192.168.1.12 400 MB 256 25.0% 675ed9f4-6564-6dbd-can8-43fddce952gy RACK2
UN 192.168.1.12 400 MB 256 25.0% 675ed9f4-6564-6dbd-ca08-43fddce952de RACK2
UN 192.168.2.10 400 MB 256 25.0% a1b2c3d4-5678-90ab-cdef-112233445566 RACK0
#. For tablets-enabled clusters, wait for tablet load balancing to complete.

View File

@@ -163,5 +163,5 @@ This example shows how to install and configure a three-node cluster using Gossi
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c 43
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e 44
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy 45
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de 45

View File

@@ -19,7 +19,7 @@ Prerequisites
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-lac8-23fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Datacenter: ASIA-DC
Status=Up/Down
@@ -165,7 +165,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Datacenter: EUROPE-DC
Status=Up/Down

View File

@@ -18,7 +18,7 @@ Removing a Running Node
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
UN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
#. If the node status is **Up Normal (UN)**, run the :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` command
to remove the node you are connected to. Using ``nodetool decommission`` is the recommended method for cluster scale-down operations. It prevents data loss
@@ -75,7 +75,7 @@ command providing the Host ID of the node you are removing. See :doc:`nodetool r
.. code-block:: console
nodetool removenode 675ed9f4-6564-6dbd-can8-43fddce952gy
nodetool removenode 675ed9f4-6564-6dbd-ca08-43fddce952de
The ``nodetool removenode`` command notifies other nodes that the token range it owns needs to be moved and
the nodes should redistribute the data using streaming. Using the command does not guarantee the consistency of the rebalanced data if

View File

@@ -23,7 +23,7 @@ Prerequisites
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
DN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Login to one of the nodes in the cluster with (UN) status, collect the following info from the node:

View File

@@ -29,7 +29,7 @@ Down (DN), and the node can be replaced.
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
Remove the Data
==================
@@ -72,7 +72,7 @@ Procedure
For example (using the Host ID of the failed node from above):
``replace_node_first_boot: 675ed9f4-6564-6dbd-can8-43fddce952gy``
``replace_node_first_boot: 675ed9f4-6564-6dbd-ca08-43fddce952de``
#. Start the new node.
@@ -90,7 +90,7 @@ Procedure
-- Address Load Tokens Owns (effective) Host ID Rack
UN 192.168.1.201 112.82 KB 256 32.7% 8d5ed9f4-7764-4dbd-bad8-43fddce94b7c B1
UN 192.168.1.202 91.11 KB 256 32.9% 125ed9f4-7777-1dbn-mac8-43fddce9123e B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-can8-43fddce952gy B1
DN 192.168.1.203 124.42 KB 256 32.6% 675ed9f4-6564-6dbd-ca08-43fddce952de B1
``192.168.1.203`` is the dead node.
@@ -121,7 +121,7 @@ Procedure
/192.168.1.203
generation:1553759866
heartbeat:2147483647
HOST_ID:675ed9f4-6564-6dbd-can8-43fddce952gy
HOST_ID:675ed9f4-6564-6dbd-ca08-43fddce952de
STATUS:shutdown,true
RELEASE_VERSION:3.0.8
X3:3
@@ -178,7 +178,7 @@ In this case, the node's data will be cleaned after restart. To remedy this, you
.. code-block:: none
echo 'replace_node_first_boot: 675ed9f4-6564-6dbd-can8-43fddce952gy' | sudo tee --append /etc/scylla/scylla.yaml
echo 'replace_node_first_boot: 675ed9f4-6564-6dbd-ca08-43fddce952de' | sudo tee --append /etc/scylla/scylla.yaml
#. Run the following command to re-setup RAID

View File

@@ -1,5 +1,5 @@
Migrate a Keyspace from Vnodes to Tablets
==========================================
Migrate a Keyspace from Vnodes to Tablets :label-caution:`Experimental`
=========================================================================
This procedure describes how to migrate an existing keyspace from vnodes
to tablets. Tablets are designed to be the long-term replacement for vnodes,
@@ -8,6 +8,9 @@ balancing, automatic cleanups, and improved streaming performance. Migrating to
tablets is strongly recommended. See :doc:`Data Distribution with Tablets </architecture/tablets/>`
for details.
This feature is experimental and will change in future releases, including
the removal of current limitations.
.. note::
The migration is an online operation. This means that the keyspace remains

View File

@@ -16,7 +16,7 @@ Cluster and Node Limits
* - Nodes per cluster
- Low hundreds
* - Node size
- 256 vcpu
- 4096 CPUs
See :ref:`Hardware Requirements <system-requirements-hardware>` for storage
and memory requirements and limits.

View File

@@ -289,8 +289,8 @@ private:
sstring _host;
host_options& _options;
output_stream<char> _output;
input_stream<char> _input;
std::optional<output_stream<char>> _output;
std::optional<input_stream<char>> _input;
seastar::connected_socket _socket;
std::optional<temporary_buffer<char>> _in_buffer;
std::optional<future<>> _pending;
@@ -347,8 +347,8 @@ future<> kmip_host::impl::connection::connect() {
// #998 Set keepalive to try avoiding connection going stale in between commands.
s.set_keepalive_parameters(net::tcp_keepalive_params{60s, 60s, 10});
s.set_keepalive(true);
_input = s.input();
_output = s.output();
_input.emplace(s.input());
_output.emplace(s.output());
});
});
});
@@ -367,9 +367,9 @@ int kmip_host::impl::connection::send(void* data, unsigned int len, unsigned int
}
kmip_log.trace("{}: Sending {} bytes", *this, len);
auto f = _output.write(reinterpret_cast<char *>(data), len).then([this] {
auto f = _output->write(reinterpret_cast<char *>(data), len).then([this] {
kmip_log.trace("{}: send done. flushing...", *this);
return _output.flush();
return _output->flush();
});
// if the call failed already, we still want to
// drop back to "wait_for_io()", because we cannot throw
@@ -405,7 +405,7 @@ int kmip_host::impl::connection::recv(void* data, unsigned int len, unsigned int
}
kmip_log.trace("{}: issue read", *this);
auto f = _input.read().then([this](temporary_buffer<char> buf) {
auto f = _input->read().then([this](temporary_buffer<char> buf) {
kmip_log.trace("{}: got {} bytes", *this, buf.size());
_in_buffer = std::move(buf);
});
@@ -462,8 +462,8 @@ void kmip_host::impl::connection::attach(KMIP_CMD* cmd) {
}
future<> kmip_host::impl::connection::close() {
return _output.close().finally([this] {
return _input.close();
return _output->close().finally([this] {
return _input->close();
});
}

View File

@@ -182,6 +182,7 @@ public:
gms::feature arbitrary_tablet_boundaries { *this, "ARBITRARY_TABLET_BOUNDARIES"sv };
gms::feature large_data_virtual_tables { *this, "LARGE_DATA_VIRTUAL_TABLES"sv };
gms::feature keyspace_multi_rf_change { *this, "KEYSPACE_MULTI_RF_CHANGE"sv };
gms::feature view_building_tasks_min_task_id { *this, "VIEW_BUILDING_TASKS_MIN_TASK_ID"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -53,6 +53,7 @@ set(idl_headers
group0.idl.hh
hinted_handoff.idl.hh
sstables.idl.hh
sstables_loader.idl.hh
storage_proxy.idl.hh
storage_service.idl.hh
strong_consistency/state_machine.idl.hh

View File

@@ -0,0 +1,12 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
class restore_result {
};
verb [[]] restore_tablet (raft::server_id dst_id, locator::global_tablet_id gid) -> restore_result;

View File

@@ -72,6 +72,7 @@ struct raft_topology_cmd_result {
success
};
service::raft_topology_cmd_result::command_status status;
sstring error_message [[version 2026.2]];
};
struct raft_snapshot {

View File

@@ -5,6 +5,8 @@ target_sources(index
PRIVATE
secondary_index.cc
secondary_index_manager.cc
fulltext_index.cc
index_option_utils.cc
vector_index.cc)
target_include_directories(index
PUBLIC

96
index/fulltext_index.cc Normal file
View File

@@ -0,0 +1,96 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "cql3/statements/index_target.hh"
#include "cql3/util.hh"
#include "exceptions/exceptions.hh"
#include "schema/schema.hh"
#include "index/fulltext_index.hh"
#include "index/index_option_utils.hh"
#include "index/secondary_index_manager.hh"
#include "utils/UUID_gen.hh"
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
namespace secondary_index {
// Supported text analyzers for fulltext indexing.
// This list corresponds to analyzers expected to be provided
// by the backend search engine (Tantivy).
static const std::vector<sstring> analyzer_values = {
"standard", "english", "german", "french", "spanish", "italian", "portuguese", "russian", "chinese", "japanese", "korean", "simple", "whitespace"};
const static std::unordered_map<sstring, std::function<void(std::string_view, const sstring&, const sstring&)>> fulltext_index_options = {
// 'analyzer' specifies the built-in text analyzer to use for tokenization.
{"analyzer", std::bind_front(util::validate_enumerated_option, analyzer_values)},
// 'positions' controls whether token positions are stored in the index.
// Required for phrase queries. Set to false to save space.
{"positions", std::bind_front(util::validate_enumerated_option, util::boolean_values)},
};
bool fulltext_index::view_should_exist() const {
return false;
}
std::optional<cql3::description> fulltext_index::describe(const index_metadata& im, const schema& base_schema) const {
auto target = im.options().at(cql3::statements::index_target::target_option_name);
auto target_column = cql3::statements::index_target::column_name_from_target_string(target);
return describe_with_target(im, base_schema, cql3::util::maybe_quote(target_column));
}
void fulltext_index::check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const {
using cql3::statements::index_target;
if (targets.size() != 1) {
throw exceptions::invalid_request_exception("Fulltext index must have exactly one target column");
}
auto& target = targets[0];
if (!std::holds_alternative<index_target::single_column>(target->value)) {
throw exceptions::invalid_request_exception("Fulltext index target must be a single column");
}
auto& column = std::get<index_target::single_column>(target->value);
auto c_name = column->to_string();
auto const* c_def = schema.get_column_definition(column->name());
if (c_def == nullptr) {
throw exceptions::invalid_request_exception(format("Column {} not found in schema", c_name));
}
auto kind = c_def->type->get_kind();
if (kind != abstract_type::kind::utf8 && kind != abstract_type::kind::ascii) {
throw exceptions::invalid_request_exception(
format("Fulltext index is only supported on text, varchar, or ascii columns, but column {} has an incompatible type", c_name));
}
}
void fulltext_index::check_index_options(const cql3::statements::index_specific_prop_defs& properties) const {
for (auto option : properties.get_raw_options()) {
auto it = fulltext_index_options.find(option.first);
if (it == fulltext_index_options.end()) {
throw exceptions::invalid_request_exception(format("Unsupported option {} for fulltext index", option.first));
}
it->second(index_type_name(), option.first, option.second);
}
}
void fulltext_index::validate(const schema& schema, const cql3::statements::index_specific_prop_defs& properties,
const std::vector<::shared_ptr<cql3::statements::index_target>>& targets, const gms::feature_service&, const data_dictionary::database&) const {
check_target(schema, targets);
check_index_options(properties);
}
utils::UUID fulltext_index::index_version(const schema& schema) {
return utils::UUID_gen::get_time_UUID();
}
std::unique_ptr<secondary_index::custom_index> fulltext_index_factory() {
return std::make_unique<fulltext_index>();
}
} // namespace secondary_index

43
index/fulltext_index.hh Normal file
View File

@@ -0,0 +1,43 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "schema/schema.hh"
#include "data_dictionary/data_dictionary.hh"
#include "cql3/statements/index_target.hh"
#include "index/secondary_index_manager.hh"
#include <vector>
namespace secondary_index {
class fulltext_index : public custom_index {
public:
std::string_view index_type_name() const override {
return "fulltext";
}
fulltext_index() = default;
~fulltext_index() override = default;
std::optional<cql3::description> describe(const index_metadata& im, const schema& base_schema) const override;
bool view_should_exist() const override;
void validate(const schema& schema, const cql3::statements::index_specific_prop_defs& properties,
const std::vector<::shared_ptr<cql3::statements::index_target>>& targets, const gms::feature_service& fs,
const data_dictionary::database& db) const override;
utils::UUID index_version(const schema& schema) override;
private:
void check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const;
void check_index_options(const cql3::statements::index_specific_prop_defs& properties) const;
};
std::unique_ptr<secondary_index::custom_index> fulltext_index_factory();
} // namespace secondary_index

View File

@@ -0,0 +1,70 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "index/index_option_utils.hh"
#include "exceptions/exceptions.hh"
#include <boost/algorithm/string.hpp>
#include <fmt/ranges.h>
#include <seastar/core/format.hh>
namespace secondary_index::util {
void validate_enumerated_option(
const std::vector<sstring>& supported_values, std::string_view index_type_name, const sstring& value_name, const sstring& value) {
bool is_valid = std::any_of(supported_values.begin(), supported_values.end(), [&](const std::string& v) {
return boost::iequals(value, v);
});
if (!is_valid) {
throw exceptions::invalid_request_exception(seastar::format("Invalid value in option '{}' for {} index: '{}'."
" Supported are case-insensitive: {}",
value_name, index_type_name, value, fmt::join(supported_values, ", ")));
}
}
void validate_positive_option(int max, std::string_view index_type_name, const sstring& value_name, const sstring& value) {
int num_value;
size_t len;
try {
num_value = std::stoi(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for {} index: '{}' is not an integer", value_name, index_type_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for {} index: '{}' is not an integer", value_name, index_type_name, value));
}
if (num_value <= 0 || num_value > max) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for {} index: '{}' is out of valid range [1 - {}]", value_name, index_type_name, value, max));
}
}
void validate_factor_option(float min, float max, std::string_view index_type_name, const sstring& value_name, const sstring& value) {
float num_value;
size_t len;
try {
num_value = std::stof(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for {} index: '{}' is not a float", value_name, index_type_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for {} index: '{}' is not a float", value_name, index_type_name, value));
}
if (!(num_value >= min && num_value <= max)) {
throw exceptions::invalid_request_exception(seastar::format(
"Invalid value in option '{}' for {} index: '{}' is out of valid range [{} - {}]", value_name, index_type_name, value, min, max));
}
}
} // namespace secondary_index::util

View File

@@ -0,0 +1,26 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <string_view>
#include <vector>
#include <seastar/core/sstring.hh>
namespace secondary_index::util {
inline const std::vector<seastar::sstring> boolean_values = {"false", "true"};
void validate_enumerated_option(const std::vector<seastar::sstring>& supported_values, std::string_view index_type_name, const seastar::sstring& value_name,
const seastar::sstring& value);
void validate_positive_option(int max, std::string_view index_type_name, const seastar::sstring& value_name, const seastar::sstring& value);
void validate_factor_option(float min, float max, std::string_view index_type_name, const seastar::sstring& value_name, const seastar::sstring& value);
} // namespace secondary_index::util

View File

@@ -9,17 +9,21 @@
*/
#include <functional>
#include <map>
#include <optional>
#include <ranges>
#include <seastar/core/shared_ptr.hh>
#include <string_view>
#include <unordered_map>
#include <unordered_set>
#include "index/secondary_index_manager.hh"
#include "index/secondary_index.hh"
#include "index/fulltext_index.hh"
#include "index/vector_index.hh"
#include "cql3/expr/expression.hh"
#include "cql3/util.hh"
#include "index/target_parser.hh"
#include "schema/schema.hh"
#include "utils/histogram_metrics_helper.hh"
@@ -211,6 +215,7 @@ std::optional<std::function<std::unique_ptr<custom_index>()>> secondary_index_ma
std::transform(lower_class_name.begin(), lower_class_name.end(), lower_class_name.begin(), ::tolower);
const static std::unordered_map<std::string_view, std::function<std::unique_ptr<custom_index>()>> classes = {
{"fulltext_index", fulltext_index_factory},
{"vector_index", vector_index_factory},
};
@@ -233,6 +238,49 @@ std::optional<std::unique_ptr<custom_index>> secondary_index_manager::get_custom
return (*custom_class_factory)();
}
std::optional<cql3::description> custom_index::describe_with_target(
const index_metadata& im,
const schema& base_schema,
const sstring& target_cql) const {
static const std::unordered_set<sstring> system_options = {
cql3::statements::index_target::target_option_name,
db::index::secondary_index::custom_class_option_name,
db::index::secondary_index::index_version_option_name,
};
fragmented_ostringstream os;
os << "CREATE CUSTOM INDEX " << cql3::util::maybe_quote(im.name()) << " ON "
<< cql3::util::maybe_quote(base_schema.ks_name()) << "."
<< cql3::util::maybe_quote(base_schema.cf_name()) << "(" << target_cql << ")"
<< " USING '" << index_type_name() << "_index'";
std::map<sstring, sstring> user_options;
for (const auto& [key, value] : im.options()) {
if (!system_options.contains(key)) {
user_options.emplace(key, value);
}
}
if (!user_options.empty()) {
os << " WITH OPTIONS = {";
bool first = true;
for (const auto& [key, value] : user_options) {
if (!first) {
os << ", ";
}
os << "'" << key << "': '" << value << "'";
first = false;
}
os << "}";
}
return cql3::description{
.keyspace = base_schema.ks_name(),
.type = "index",
.name = im.name(),
.create_statement = std::move(os).to_managed_string(),
};
}
stats::stats(const sstring& ks_name, const sstring& index_name) {
metrics.add_group("index",
{seastar::metrics::make_histogram("query_latencies", seastar::metrics::description("Index query latencies"), {idx(index_name), ks(ks_name)},

View File

@@ -100,6 +100,7 @@ public:
class custom_index {
public:
virtual ~custom_index() = default;
virtual std::string_view index_type_name() const = 0;
/// Returns a custom description of the index, or std::nullopt if the default index description logic should be used instead.
virtual std::optional<cql3::description> describe(const index_metadata& im, const schema& base_schema) const = 0;
virtual bool view_should_exist() const = 0;
@@ -107,6 +108,12 @@ public:
const std::vector<::shared_ptr<cql3::statements::index_target>> &targets, const gms::feature_service& fs,
const data_dictionary::database& db) const = 0;
virtual utils::UUID index_version(const schema& schema) = 0;
protected:
std::optional<cql3::description> describe_with_target(
const index_metadata& im,
const schema& base_schema,
const sstring& target_cql) const;
};
struct stats {

View File

@@ -14,66 +14,19 @@
#include "exceptions/exceptions.hh"
#include "schema/schema.hh"
#include "index/vector_index.hh"
#include "index/index_option_utils.hh"
#include "index/secondary_index.hh"
#include "index/secondary_index_manager.hh"
#include "index/target_parser.hh"
#include "types/concrete_types.hh"
#include "utils/UUID_gen.hh"
#include "types/types.hh"
#include "utils/managed_string.hh"
#include <ranges>
#include <seastar/core/sstring.hh>
#include <boost/algorithm/string.hpp>
namespace secondary_index {
static void validate_positive_option(int max, const sstring& value_name, const sstring& value) {
int num_value;
size_t len;
try {
num_value = std::stoi(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not an integer", value_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not an integer", value_name, value));
}
if (num_value <= 0 || num_value > max) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is out of valid range [1 - {}]", value_name, value, max));
}
}
static void validate_factor_option(float min, float max, const sstring& value_name, const sstring& value) {
float num_value;
size_t len;
try {
num_value = std::stof(value, &len);
} catch (...) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not a float", value_name, value));
}
if (len != value.size()) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is not a float", value_name, value));
}
if (!(num_value >= min && num_value <= max)) {
throw exceptions::invalid_request_exception(format("Invalid value in option '{}' for vector index: '{}' is out of valid range [{} - {}]", value_name, value, min, max));
}
}
static void validate_enumerated_option(const std::vector<sstring>& supported_values, const sstring& value_name, const sstring& value) {
bool is_valid = std::any_of(supported_values.begin(), supported_values.end(),
[&](const std::string& func) { return boost::iequals(value, func); });
if (!is_valid) {
throw exceptions::invalid_request_exception(
seastar::format("Invalid value in option '{}' for vector index: '{}'. Supported are case-insensitive: {}",
value_name,
value,
fmt::join(supported_values, ", ")));
}
}
static const std::vector<sstring> similarity_function_values = {
"cosine", "euclidean", "dot_product"
};
@@ -82,33 +35,29 @@ static const std::vector<sstring> quantization_values = {
"f32", "f16", "bf16", "i8", "b1"
};
static const std::vector<sstring> boolean_values = {
"false", "true"
};
const static std::unordered_map<sstring, std::function<void(const sstring&, const sstring&)>> vector_index_options = {
const static std::unordered_map<sstring, std::function<void(std::string_view, const sstring&, const sstring&)>> vector_index_options = {
// `similarity_function` defines method of calculating similarity between vectors
// Used internally by vector store during both indexing and querying
// CQL implements corresponding functions in cql3/functions/similarity_functions.hh
{"similarity_function", std::bind_front(validate_enumerated_option, similarity_function_values)},
{"similarity_function", std::bind_front(util::validate_enumerated_option, similarity_function_values)},
// 'maximum_node_connections', 'construction_beam_width', 'search_beam_width' define HNSW index parameters
// Used internally by vector store.
{"maximum_node_connections", std::bind_front(validate_positive_option, 512)},
{"construction_beam_width", std::bind_front(validate_positive_option, 4096)},
{"search_beam_width", std::bind_front(validate_positive_option, 4096)},
{"maximum_node_connections", std::bind_front(util::validate_positive_option, 512)},
{"construction_beam_width", std::bind_front(util::validate_positive_option, 4096)},
{"search_beam_width", std::bind_front(util::validate_positive_option, 4096)},
// 'quantization' enables compression of vectors in vector store (not in base table!)
// Used internally by vector store. Scylla only checks it to enable rescoring.
{"quantization", std::bind_front(validate_enumerated_option, quantization_values)},
{"quantization", std::bind_front(util::validate_enumerated_option, quantization_values)},
// 'oversampling' defines factor by which number of candidates retrieved from vector store is multiplied.
// It can improve accuracy of ANN queries, especially for quantized vectors when combined with rescoring.
// Used by Scylla during query processing to increase query limit sent to vector store.
{"oversampling", std::bind_front(validate_factor_option, 1.0f, 100.0f)},
{"oversampling", std::bind_front(util::validate_factor_option, 1.0f, 100.0f)},
// 'rescoring' enables recalculating of similarity scores of candidates retrieved from vector store when quantization is used.
{"rescoring", std::bind_front(validate_enumerated_option, boolean_values)},
{"rescoring", std::bind_front(util::validate_enumerated_option, util::boolean_values)},
// 'source_model' is a Cassandra SAI option specifying the embedding model name.
// Used by Cassandra libraries (e.g., CassIO) to tag indexes with the model that produced the vectors.
// Accepted for compatibility but not used by ScyllaDB.
{"source_model", [](const sstring&, const sstring&) { /* accepted for Cassandra compatibility */ }},
{"source_model", [](std::string_view, const sstring&, const sstring&) { /* accepted for Cassandra compatibility */ }},
};
static constexpr auto TC_TARGET_KEY = "tc";
@@ -255,43 +204,8 @@ bool vector_index::view_should_exist() const {
}
std::optional<cql3::description> vector_index::describe(const index_metadata& im, const schema& base_schema) const {
static const std::unordered_set<sstring> system_options = {
cql3::statements::index_target::target_option_name,
db::index::secondary_index::custom_class_option_name,
db::index::secondary_index::index_version_option_name,
};
fragmented_ostringstream os;
os << "CREATE CUSTOM INDEX " << cql3::util::maybe_quote(im.name()) << " ON " << cql3::util::maybe_quote(base_schema.ks_name()) << "."
<< cql3::util::maybe_quote(base_schema.cf_name()) << "(" << targets_to_cql(im.options().at(cql3::statements::index_target::target_option_name)) << ")"
<< " USING 'vector_index'";
// Collect user-provided options (excluding system keys like target, class_name, index_version).
std::map<sstring, sstring> user_options;
for (const auto& [key, value] : im.options()) {
if (!system_options.contains(key)) {
user_options.emplace(key, value);
}
}
if (!user_options.empty()) {
os << " WITH OPTIONS = {";
bool first = true;
for (const auto& [key, value] : user_options) {
if (!first) {
os << ", ";
}
os << "'" << key << "': '" << value << "'";
first = false;
}
os << "}";
}
return cql3::description{
.keyspace = base_schema.ks_name(),
.type = "index",
.name = im.name(),
.create_statement = std::move(os).to_managed_string(),
};
return describe_with_target(im, base_schema,
targets_to_cql(im.options().at(cql3::statements::index_target::target_option_name)));
}
void vector_index::check_target(const schema& schema, const std::vector<::shared_ptr<cql3::statements::index_target>>& targets) const {
@@ -429,7 +343,7 @@ void vector_index::check_index_options(const cql3::statements::index_specific_pr
if (it == vector_index_options.end()) {
throw exceptions::invalid_request_exception(format("Unsupported option {} for vector index", option.first));
}
it->second(option.first, option.second);
it->second(index_type_name(), option.first, option.second);
}
}

View File

@@ -20,6 +20,8 @@ namespace secondary_index {
class vector_index: public custom_index {
public:
std::string_view index_type_name() const override { return "vector"; }
// The minimal TTL for the CDC used by Vector Search.
// Required to ensure that the data is not deleted until the vector index is fully built.
static constexpr int VS_TTL_SECONDS = 86400; // 24 hours

View File

@@ -15,6 +15,7 @@
#include <ranges>
#include "utils/assert.hh"
#include "utils/serialization.hh"
#include "exceptions/exceptions.hh"
#include <seastar/util/backtrace.hh>
enum class allow_prefixes { no, yes };
@@ -103,7 +104,12 @@ public:
static managed_bytes serialize_value(RangeOfSerializedComponents&& values) {
auto size = serialized_size(values);
if (size > std::numeric_limits<size_type>::max()) {
throw std::runtime_error(format("Key size too large: {:d} > {:d}", size, std::numeric_limits<size_type>::max()));
// Matches Cassandra's wording so CQL-level compatibility tests
// (and client-visible error messages) line up.
// Issues #10366 (SELECT) and #12247 (INSERT) both require a
// clean InvalidRequest here rather than a generic server error.
throw exceptions::invalid_request_exception(format("Key length of {:d} is longer than maximum of {:d}",
size, std::numeric_limits<size_type>::max()));
}
managed_bytes b(managed_bytes::initialized_later(), size);
serialize_value(values, managed_bytes_mutable_view(b));

View File

@@ -90,6 +90,8 @@ write_replica_set_selector get_selector_for_writes(tablet_transition_stage stage
return write_replica_set_selector::previous;
case tablet_transition_stage::end_migration:
return write_replica_set_selector::next;
case tablet_transition_stage::restore:
return write_replica_set_selector::previous;
}
on_internal_error(tablet_logger, format("Invalid tablet transition stage: {}", static_cast<int>(stage)));
}
@@ -123,6 +125,8 @@ read_replica_set_selector get_selector_for_reads(tablet_transition_stage stage)
return read_replica_set_selector::previous;
case tablet_transition_stage::end_migration:
return read_replica_set_selector::next;
case tablet_transition_stage::restore:
return read_replica_set_selector::previous;
}
on_internal_error(tablet_logger, format("Invalid tablet transition stage: {}", static_cast<int>(stage)));
}
@@ -131,12 +135,14 @@ tablet_transition_info::tablet_transition_info(tablet_transition_stage stage,
tablet_transition_kind transition,
tablet_replica_set next,
std::optional<tablet_replica> pending_replica,
service::session_id session_id)
service::session_id session_id,
std::optional<locator::restore_config> restore_cfg)
: stage(stage)
, transition(transition)
, next(std::move(next))
, pending_replica(std::move(pending_replica))
, session_id(session_id)
, restore_cfg(std::move(restore_cfg))
, writes(get_selector_for_writes(stage))
, reads(get_selector_for_reads(stage))
{ }
@@ -186,12 +192,20 @@ tablet_migration_streaming_info get_migration_streaming_info(const locator::topo
return result;
}
case tablet_transition_kind::repair:
case tablet_transition_kind::repair: {
auto s = std::unordered_set<tablet_replica>(tinfo.replicas.begin(), tinfo.replicas.end());
result.stream_weight = locator::tablet_migration_stream_weight_repair;
result.read_from = s;
result.written_to = std::move(s);
return result;
}
case tablet_transition_kind::restore: {
auto s = std::unordered_set<tablet_replica>(tinfo.replicas.begin(), tinfo.replicas.end());
result.stream_weight = locator::tablet_migration_stream_weight_restore;
result.read_from = s;
result.written_to = std::move(s);
return result;
}
}
on_internal_error(tablet_logger, format("Invalid tablet transition kind: {}", static_cast<int>(trinfo.transition)));
}
@@ -847,6 +861,7 @@ static const std::unordered_map<tablet_transition_stage, sstring> tablet_transit
{tablet_transition_stage::cleanup_target, "cleanup_target"},
{tablet_transition_stage::revert_migration, "revert_migration"},
{tablet_transition_stage::end_migration, "end_migration"},
{tablet_transition_stage::restore, "restore"},
};
static const std::unordered_map<sstring, tablet_transition_stage> tablet_transition_stage_from_name = std::invoke([] {
@@ -880,6 +895,7 @@ static const std::unordered_map<tablet_transition_kind, sstring> tablet_transiti
{tablet_transition_kind::rebuild, "rebuild"},
{tablet_transition_kind::rebuild_v2, "rebuild_v2"},
{tablet_transition_kind::repair, "repair"},
{tablet_transition_kind::restore, "restore"},
};
static const std::unordered_map<sstring, tablet_transition_kind> tablet_transition_kind_from_name = std::invoke([] {
@@ -1126,6 +1142,8 @@ std::optional<uint64_t> load_stats::get_tablet_size_in_transition(host_id host,
}
case tablet_transition_kind::intranode_migration:
[[fallthrough]];
case tablet_transition_kind::restore:
[[fallthrough]];
case tablet_transition_kind::repair:
break;
}

View File

@@ -268,6 +268,13 @@ struct tablet_task_info {
static std::unordered_set<sstring> deserialize_repair_dcs_filter(sstring filter);
};
struct restore_config {
sstring snapshot_name;
sstring endpoint;
sstring bucket;
bool operator==(const restore_config&) const = default;
};
/// Stores information about a single tablet.
struct tablet_info {
tablet_replica_set replicas;
@@ -323,6 +330,7 @@ enum class tablet_transition_stage {
end_migration,
repair,
end_repair,
restore,
};
enum class tablet_transition_kind {
@@ -345,6 +353,9 @@ enum class tablet_transition_kind {
// Repair the tablet replicas
repair,
// Download sstables for tablet
restore,
};
tablet_transition_kind choose_rebuild_transition_kind(const gms::feature_service& features);
@@ -368,6 +379,7 @@ struct tablet_transition_info {
tablet_replica_set next;
std::optional<tablet_replica> pending_replica; // Optimization (next - tablet_info::replicas)
service::session_id session_id;
std::optional<locator::restore_config> restore_cfg;
write_replica_set_selector writes;
read_replica_set_selector reads;
@@ -375,7 +387,8 @@ struct tablet_transition_info {
tablet_transition_kind kind,
tablet_replica_set next,
std::optional<tablet_replica> pending_replica,
service::session_id session_id = {});
service::session_id session_id = {},
std::optional<locator::restore_config> rcfg = std::nullopt);
bool operator==(const tablet_transition_info&) const = default;
};
@@ -406,6 +419,7 @@ tablet_transition_info migration_to_transition_info(const tablet_info&, const ta
/// Describes streaming required for a given tablet transition.
constexpr int tablet_migration_stream_weight_default = 1;
constexpr int tablet_migration_stream_weight_repair = 2;
constexpr int tablet_migration_stream_weight_restore = 2;
struct tablet_migration_streaming_info {
std::unordered_set<tablet_replica> read_from;
std::unordered_set<tablet_replica> written_to;

72
main.cc
View File

@@ -30,6 +30,7 @@
#include "utils/build_id.hh"
#include "utils/only_on_shard0.hh"
#include "supervisor.hh"
#include "timeout_config.hh"
#include "replica/database.hh"
#include <seastar/core/reactor.hh>
#include <seastar/core/app-template.hh>
@@ -67,6 +68,7 @@
#include "vector_search/vector_store_client.hh"
#include <cstdio>
#include <seastar/core/file.hh>
#include <stdexcept>
#include <unistd.h>
#include <sys/time.h>
#include <sys/resource.h>
@@ -86,6 +88,7 @@
#include "service/cache_hitrate_calculator.hh"
#include "compaction/compaction_manager.hh"
#include "sstables/sstables.hh"
#include "sstables/exceptions.hh"
#include "gms/feature_service.hh"
#include "replica/distributed_loader.hh"
#include "sstables_loader.hh"
@@ -221,6 +224,33 @@ read_config(bpo::variables_map& opts, db::config& cfg) {
}
}
static void
self_heal_service_levels_version(db::system_keyspace& sys_ks, cql3::query_processor& qp, service::raft_group0_client& group0_client, abort_source& as) {
static constexpr unsigned max_attempts = 10;
for (unsigned attempt = 1; attempt <= max_attempts; ++attempt) {
try {
auto guard = group0_client.start_operation(as).get();
auto service_levels_version = sys_ks.get_service_levels_version().get();
service::release_guard(std::move(guard));
if (service_levels_version && *service_levels_version == 2) {
startlog.info("Service levels version marker was already self-healed to v2.");
return;
}
auto nodes_count = qp.db().real_database().get_token_metadata().get_normal_token_owners().size();
qos::service_level_controller::migrate_to_v2(nodes_count, sys_ks, qp, group0_client, as).get();
group0_client.send_group0_read_barrier_to_live_members().get();
startlog.info("Self-healed service levels version marker to v2.");
return;
} catch (...) {
if (attempt == max_attempts) {
std::throw_with_nested(std::runtime_error(format("Failed to self-heal service levels version marker after {} attempts", max_attempts)));
}
startlog.info("Concurrent group0 operation while self-healing service levels version marker, retrying ({}/{}).", attempt, max_attempts);
}
}
}
#ifdef SCYLLA_ENABLE_ERROR_INJECTION
static future<>
enable_initial_error_injections(const db::config& cfg) {
@@ -1042,6 +1072,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
});
set_abort_on_internal_error(cfg->abort_on_internal_error());
auto abort_on_malformed_sstable_error_observer = cfg->abort_on_malformed_sstable_error.observe([] (bool val) {
sstables::set_abort_on_malformed_sstable_error(val);
});
sstables::set_abort_on_malformed_sstable_error(cfg->abort_on_malformed_sstable_error());
checkpoint(stop_signal, "creating snitch");
debug::the_snitch = &snitch;
snitch_config snitch_cfg;
@@ -1368,6 +1403,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
spcfg.hints_write_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get();
spcfg.write_ack_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get();
static db::view::node_update_backlog node_backlog(smp::count, 10ms, cfg->view_flow_control_delay_limit_in_ms);
static sharded<updateable_timeout_config> timeout_cfg;
timeout_cfg.start(std::ref(*cfg)).get();
auto stop_timeout_cfg = defer_verbose_shutdown("updateable timeout config", [] { timeout_cfg.stop().get(); });
scheduling_group_key_config storage_proxy_stats_cfg =
make_scheduling_group_key_config<service::storage_proxy_stats::stats>();
storage_proxy_stats_cfg.constructor = [plain_constructor = storage_proxy_stats_cfg.constructor] (void* ptr) {
@@ -1381,7 +1421,8 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
};
proxy.start(std::ref(db), spcfg, std::ref(node_backlog),
scheduling_group_key_create(storage_proxy_stats_cfg).get(),
std::ref(feature_service), std::ref(token_metadata), std::ref(erm_factory)).get();
std::ref(feature_service), std::ref(token_metadata), std::ref(erm_factory),
std::ref(timeout_cfg)).get();
// #293 - do not stop anything
// engine().at_exit([&proxy] { return proxy.stop(); });
@@ -1512,6 +1553,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
sys_ks.local().build_bootstrap_info().get();
bool should_self_heal_service_levels_version = false;
if (sys_ks.local().bootstrap_complete()) {
// Check as early as possible if the cluster is fully upgraded to use Raft, since if it's not, then this node cannot be started with the current version.
if (sys_ks.local().load_group0_upgrade_state().get() != "use_post_raft_procedures") {
@@ -1519,7 +1561,8 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
" a node of a cluster that is not using Raft yet. This is no longer supported. Please first complete the upgrade of the cluster to use Raft");
}
if (sys_ks.local().load_topology_upgrade_state().get() != "done") {
const bool raft_topology_done = sys_ks.local().load_topology_upgrade_state().get() == "done";
if (!raft_topology_done) {
throw std::runtime_error(
"Cannot start - cluster is not yet upgraded to use raft topology and this version does not support legacy topology operations. "
"If you are trying to upgrade the node then first upgrade the cluster to use raft topology.");
@@ -1530,10 +1573,14 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
"Cannot start - cluster is not yet upgraded to use auth v2 and this version does not support legacy auth. "
"If you are trying to upgrade the node then first upgrade the cluster to use auth v2.");
}
if (sys_ks.local().get_service_levels_version().get() != 2) {
throw std::runtime_error(
"Cannot start - cluster is not yet upgraded to use service levels v2 and this version does not support legacy service levels. "
"If you are trying to upgrade the node then first upgrade the cluster to use service levels v2.");
auto service_levels_version = sys_ks.local().get_service_levels_version().get();
if (raft_topology_done && (!service_levels_version || *service_levels_version != 2)
&& !utils::get_local_injector().enter("skip_service_levels_v2_initialization")) {
should_self_heal_service_levels_version = true;
startlog.warn(
"Cluster is using raft topology but service levels are still marked as version {}. "
"Startup will continue and the service levels version marker will be self-healed after group0 starts.",
service_levels_version ? format("{}", *service_levels_version) : "unset");
}
if (sys_ks.local().get_view_builder_version().get() != db::system_keyspace::view_builder_version_t::v2) {
throw std::runtime_error(
@@ -2186,7 +2233,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auth::make_maintenance_socket_role_manager_factory(qp, group0_client, mm, auth_cache),
maintenance_socket_enabled::yes, std::ref(auth_cache)).get();
cql_maintenance_server_ctl.emplace(maintenance_auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, messaging, *cfg, maintenance_cql_sg_stats_key, maintenance_socket_enabled::yes, dbcfg.statement_scheduling_group);
cql_maintenance_server_ctl.emplace(maintenance_auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, messaging, timeout_cfg, *cfg, maintenance_cql_sg_stats_key, maintenance_socket_enabled::yes, dbcfg.statement_scheduling_group);
start_auth_service(maintenance_auth_service, stop_maintenance_auth_service, "maintenance auth service");
}
@@ -2255,7 +2302,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
});
checkpoint(stop_signal, "starting sstables loader");
sst_loader.start(std::ref(db), std::ref(ss), std::ref(messaging), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(sstm), dbcfg.streaming_scheduling_group).get();
sst_loader.start(std::ref(db), std::ref(ss), std::ref(messaging), std::ref(view_builder), std::ref(view_building_worker), std::ref(task_manager), std::ref(sstm), std::ref(sys_dist_ks), dbcfg.streaming_scheduling_group).get();
auto stop_sst_loader = defer_verbose_shutdown("sstables loader", [&sst_loader] {
sst_loader.stop().get();
});
@@ -2355,6 +2402,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
}).get();
stop_signal.ready(false);
if (should_self_heal_service_levels_version) {
checkpoint(stop_signal, "self-healing service levels version");
self_heal_service_levels_version(sys_ks.local(), qp.local(), group0_client, stop_signal.as_local_abort_source());
}
// At this point, `locator::topology` should be stable, i.e. we should have complete information
// about the layout of the cluster (= list of nodes along with the racks/DCs).
startlog.info("Verifying that all of the keyspaces are RF-rack-valid");
@@ -2618,11 +2670,11 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
// after drain stops them in stop_transport()
// Register controllers after drain_on_shutdown() below, so that even on start
// failure drain is called and stops controllers
cql_transport::controller cql_server_ctl(auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, messaging, *cfg, cql_sg_stats_key, maintenance_socket_enabled::no, dbcfg.statement_scheduling_group);
cql_transport::controller cql_server_ctl(auth_service, mm_notifier, gossiper, qp, service_memory_limiter, sl_controller, lifecycle_notifier, messaging, timeout_cfg, *cfg, cql_sg_stats_key, maintenance_socket_enabled::no, dbcfg.statement_scheduling_group);
api::set_server_service_levels(ctx, cql_server_ctl, qp).get();
alternator::controller alternator_ctl(gossiper, proxy, ss, mm, sys_dist_ks, sys_ks, cdc_generation_service, service_memory_limiter, auth_service, sl_controller, vector_store_client, *cfg, dbcfg.statement_scheduling_group);
alternator::controller alternator_ctl(gossiper, proxy, ss, mm, sys_dist_ks, sys_ks, cdc_generation_service, service_memory_limiter, auth_service, sl_controller, vector_store_client, timeout_cfg, *cfg, dbcfg.statement_scheduling_group);
// Register at_exit last, so that storage_service::drain_on_shutdown will be called first
auto do_drain = defer_verbose_shutdown("local storage", [&ss] {

View File

@@ -24,6 +24,7 @@
#include "service/storage_service.hh"
#include "service/qos/service_level_controller.hh"
#include "streaming/prepare_message.hh"
#include "sstables_loader.hh"
#include "gms/gossip_digest_syn.hh"
#include "gms/gossip_digest_ack.hh"
#include "gms/gossip_digest_ack2.hh"
@@ -139,6 +140,7 @@
#include "idl/tasks.dist.impl.hh"
#include "idl/forward_cql.dist.impl.hh"
#include "gms/feature_service.hh"
#include "idl/sstables_loader.dist.impl.hh"
namespace netw {
@@ -734,6 +736,7 @@ static constexpr unsigned do_get_rpc_client_idx(messaging_verb verb) {
case messaging_verb::TABLE_LOAD_STATS:
case messaging_verb::WORK_ON_VIEW_BUILDING_TASKS:
case messaging_verb::SNAPSHOT_WITH_TABLETS:
case messaging_verb::RESTORE_TABLET:
return 1;
case messaging_verb::CLIENT_ID:
case messaging_verb::MUTATION:

View File

@@ -214,7 +214,8 @@ enum class messaging_verb : int32_t {
RAFT_READ_BARRIER = 85,
FORWARD_CQL_EXECUTE = 86,
FORWARD_CQL_PREPARE = 87,
LAST = 88,
RESTORE_TABLET = 88,
LAST = 89,
};
} // namespace netw

View File

@@ -1279,6 +1279,9 @@ future<int> repair_service::do_repair_start(gms::gossip_address_map& addr_map, s
}
if (!options.start_token.empty() || !options.end_token.empty()) {
if (!options.start_token.empty() && !options.end_token.empty() && options.start_token == options.end_token) {
throw std::invalid_argument("Start and end tokens must be different.");
}
// Intersect the list of local ranges with the given token range,
// dropping ranges with no intersection.
std::optional<::wrapping_interval<dht::token>::bound> tok_start;

View File

@@ -206,6 +206,7 @@ public:
lw_shared_ptr<memtable_list>& memtables() noexcept;
size_t memtable_count() const noexcept;
bool memtable_empty() const noexcept;
// Returns minimum timestamp from memtable list
api::timestamp_type min_memtable_timestamp() const;
// Returns maximum timestamp from memtable list
@@ -289,6 +290,9 @@ public:
seastar::named_gate& sstable_add_gate() noexcept {
return _sstable_add_gate;
}
const seastar::named_gate& sstable_add_gate() const noexcept {
return _sstable_add_gate;
}
compaction::compaction_manager& get_compaction_manager() noexcept;
const compaction::compaction_manager& get_compaction_manager() const noexcept;
@@ -526,3 +530,13 @@ public:
};
}
template <> struct fmt::formatter<replica::compaction_group> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const replica::compaction_group&, fmt::format_context& ctx) const -> decltype(ctx.out());
};
template <> struct fmt::formatter<replica::storage_group> {
constexpr auto parse(format_parse_context& ctx) { return ctx.begin(); }
auto format(const replica::storage_group&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -726,6 +726,18 @@ database::setup_metrics() {
sm::description("Number of large partitions exceeding compaction_large_partition_warning_threshold_mb. "
"Large partitions have performance impact and should be avoided, check the documentation for details.")),
sm::make_counter("large_rows_exceeding_threshold", [this] { return _large_data_handler->stats().rows_bigger_than_threshold; },
sm::description("Number of large rows exceeding compaction_large_row_warning_threshold_mb. "
"Large rows have performance impact and should be avoided, check the documentation for details.")),
sm::make_counter("large_cell_exceeding_threshold", [this] { return _large_data_handler->stats().cells_bigger_than_threshold; },
sm::description("Number of large cells exceeding compaction_large_cell_warning_threshold_mb. "
"Large cells have performance impact and should be avoided, check the documentation for details.")),
sm::make_counter("large_collection_exceeding_threshold", [this] { return _large_data_handler->stats().collections_bigger_than_threshold; },
sm::description("Number of large collections exceeding compaction_collection_elements_count_warning_threshold. "
"Large collections have performance impact and should be avoided, check the documentation for details.")),
sm::make_total_operations("total_view_updates_pushed_local", _cf_stats.total_view_updates_pushed_local,
sm::description("Total number of view updates generated for tables and applied locally."))(basic_level),

View File

@@ -1413,6 +1413,15 @@ compaction_group& table::compaction_group_for_key(partition_key_view key, const
return _sg_manager->compaction_group_for_key(key, s);
}
static sstring sstable_desc(const sstables::shared_sstable& sst) {
auto& identifier_opt = sst->sstable_identifier();
auto& originating_host_id_opt = sst->get_stats_metadata().originating_host_id;
return format("{} (originated from {} with id {} on host {})",
sst->get_filename(), sst->get_origin(),
identifier_opt ? identifier_opt->to_sstring() : "unknown",
originating_host_id_opt ? originating_host_id_opt->to_sstring() : "unknown");
}
compaction_group& tablet_storage_group_manager::compaction_group_for_token_range(sstring desc, dht::token first_token, dht::token last_token) const {
auto first_id = storage_group_of(first_token);
auto last_id = storage_group_of(last_token);
@@ -1446,15 +1455,6 @@ compaction_group& tablet_storage_group_manager::compaction_group_for_sstable(con
auto first_token = sst->get_first_decorated_key().token();
auto last_token = sst->get_last_decorated_key().token();
auto sstable_desc = [] (const sstables::shared_sstable& sst) {
auto& identifier_opt = sst->sstable_identifier();
auto& originating_host_id_opt = sst->get_stats_metadata().originating_host_id;
return format("{} (originated from {} with id {} on host {})",
sst->get_filename(), sst->get_origin(),
identifier_opt ? identifier_opt->to_sstring() : "unknown",
originating_host_id_opt ? originating_host_id_opt->to_sstring() : "unknown");
};
return compaction_group_for_token_range(sstable_desc(sst), first_token, last_token);
}
@@ -3313,7 +3313,8 @@ bool has_size_on_leaving (locator::tablet_transition_stage stage) {
case locator::tablet_transition_stage::revert_migration: [[fallthrough]];
case locator::tablet_transition_stage::rebuild_repair: [[fallthrough]];
case locator::tablet_transition_stage::repair: [[fallthrough]];
case locator::tablet_transition_stage::end_repair:
case locator::tablet_transition_stage::end_repair: [[fallthrough]];
case locator::tablet_transition_stage::restore:
return true;
case locator::tablet_transition_stage::cleanup: [[fallthrough]];
case locator::tablet_transition_stage::end_migration:
@@ -3336,7 +3337,8 @@ bool has_size_on_pending (locator::tablet_transition_stage stage) {
case locator::tablet_transition_stage::cleanup: [[fallthrough]];
case locator::tablet_transition_stage::end_migration: [[fallthrough]];
case locator::tablet_transition_stage::repair: [[fallthrough]];
case locator::tablet_transition_stage::end_repair:
case locator::tablet_transition_stage::end_repair: [[fallthrough]];
case locator::tablet_transition_stage::restore:
return true;
}
}
@@ -3445,8 +3447,8 @@ void tablet_storage_group_manager::handle_tablet_split_completion(const locator:
for (auto& [id, sg] : _storage_groups) {
if (!sg->split_unready_groups_are_empty()) {
on_internal_error(tlogger, format("Found that storage of group {} for table {} wasn't split correctly, " \
"therefore groups cannot be remapped with the new tablet count.",
id, table_id));
"therefore groups cannot be remapped with the new tablet count.\nDiagnostics: {}",
id, table_id, *sg));
}
// Remove old empty groups, they're unused, but they need to be deregistered properly
// FIXME: indent.
@@ -4527,6 +4529,10 @@ size_t compaction_group::memtable_count() const noexcept {
return _memtables->size();
}
bool compaction_group::memtable_empty() const noexcept {
return _memtables->empty();
}
size_t storage_group::memtable_count() const {
size_t count = 0;
for_each_compaction_group([&count] (const compaction_group_ptr& cg) {
@@ -5765,3 +5771,43 @@ tombstone_gc_state table::get_tombstone_gc_state() const {
}
} // namespace replica
auto fmt::formatter<replica::compaction_group>::format(const replica::compaction_group& cg, fmt::format_context& ctx) const -> decltype(ctx.out()) {
auto out = ctx.out();
out = fmt::format_to(out, "[sstables=[");
bool first = true;
for (const auto& sst : cg.all_sstables()) {
if (!first) {
out = fmt::format_to(out, ", ");
}
out = fmt::format_to(out, "{}", replica::sstable_desc(sst));
first = false;
}
return fmt::format_to(out, "], memtable_empty={}, sstable_add_gate={}]",
cg.memtable_empty(),
cg.sstable_add_gate().get_count());
}
auto fmt::formatter<replica::storage_group>::format(const replica::storage_group& sg, fmt::format_context& ctx) const -> decltype(ctx.out()) {
auto out = ctx.out();
out = fmt::format_to(out, "main={}", *sg.main_compaction_group());
out = fmt::format_to(out, ", merging=[");
bool first = true;
for (const auto& cg : sg.merging_groups()) {
if (!first) {
out = fmt::format_to(out, ", ");
}
out = fmt::format_to(out, "{}", *cg);
first = false;
}
out = fmt::format_to(out, "], split_ready=[");
first = true;
for (const auto& cg : sg.split_ready_compaction_groups()) {
if (!first) {
out = fmt::format_to(out, ", ");
}
out = fmt::format_to(out, "{}", *cg);
first = false;
}
return fmt::format_to(out, "]");
}

View File

@@ -50,6 +50,9 @@ public:
tablet_mutation_builder& set_resize_task_info(locator::tablet_task_info info, const gms::feature_service& features);
tablet_mutation_builder& del_resize_task_info(const gms::feature_service& features);
tablet_mutation_builder& set_base_table(table_id base_table);
tablet_mutation_builder& set_restore_config(dht::token last_token, locator::restore_config rcfg);
tablet_mutation_builder& del_restore_config(dht::token last_token);
mutation build() {
return std::move(_m);

View File

@@ -40,6 +40,9 @@ static thread_local auto tablet_task_info_type = user_type_impl::get_instance(
static thread_local auto replica_type = tuple_type_impl::get_instance({uuid_type, int32_type});
static thread_local auto replica_set_type = list_type_impl::get_instance(replica_type, false);
static thread_local auto tablet_info_type = tuple_type_impl::get_instance({long_type, long_type, replica_set_type});
static thread_local auto restore_config_type = user_type_impl::get_instance(
"system", "restore_config", {"snapshot_name", "endpoint", "bucket"},
{utf8_type, utf8_type, utf8_type}, false);
data_type get_replica_set_type() {
return replica_set_type;
@@ -52,6 +55,7 @@ data_type get_tablet_info_type() {
void tablet_add_repair_scheduler_user_types(const sstring& ks, replica::database& db) {
db.find_keyspace(ks).add_user_type(repair_scheduler_config_type);
db.find_keyspace(ks).add_user_type(tablet_task_info_type);
db.find_keyspace(ks).add_user_type(restore_config_type);
}
static bool strongly_consistent_tables_enabled = false;
@@ -87,7 +91,8 @@ schema_ptr make_tablets_schema() {
.with_column("repair_incremental_mode", utf8_type)
.with_column("migration_task_info", tablet_task_info_type)
.with_column("resize_task_info", tablet_task_info_type, column_kind::static_column)
.with_column("base_table", uuid_type, column_kind::static_column);
.with_column("base_table", uuid_type, column_kind::static_column)
.with_column("restore_config", restore_config_type);
if (strongly_consistent_tables_enabled) {
builder
@@ -221,6 +226,15 @@ data_value tablet_task_info_to_data_value(const locator::tablet_task_info& info)
return result;
};
data_value restore_config_to_data_value(const locator::restore_config& cfg) {
data_value result = make_user_value(restore_config_type, {
data_value(cfg.snapshot_name),
data_value(cfg.endpoint),
data_value(cfg.bucket),
});
return result;
};
data_value repair_scheduler_config_to_data_value(const locator::repair_scheduler_config& config) {
data_value result = make_user_value(repair_scheduler_config_type, {
data_value(config.auto_repair_enabled),
@@ -444,6 +458,12 @@ tablet_mutation_builder::set_repair_task_info(dht::token last_token, locator::ta
return *this;
}
tablet_mutation_builder&
tablet_mutation_builder::set_restore_config(dht::token last_token, locator::restore_config rcfg) {
_m.set_clustered_cell(get_ck(last_token), "restore_config", restore_config_to_data_value(rcfg), _ts);
return *this;
}
tablet_mutation_builder&
tablet_mutation_builder::del_repair_task_info(dht::token last_token, const gms::feature_service& features) {
auto col = _s->get_column_definition("repair_task_info");
@@ -455,6 +475,13 @@ tablet_mutation_builder::del_repair_task_info(dht::token last_token, const gms::
return *this;
}
tablet_mutation_builder&
tablet_mutation_builder::del_restore_config(dht::token last_token) {
auto col = _s->get_column_definition("restore_config");
_m.set_clustered_cell(get_ck(last_token), *col, atomic_cell::make_dead(_ts, gc_clock::now()));
return *this;
}
tablet_mutation_builder&
tablet_mutation_builder::set_migration_task_info(dht::token last_token, locator::tablet_task_info migration_task_info, const gms::feature_service& features) {
if (features.tablet_migration_virtual_task) {
@@ -545,6 +572,22 @@ locator::tablet_task_info deserialize_tablet_task_info(cql3::untyped_result_set_
tablet_task_info_type->deserialize_value(raw_value));
}
locator::restore_config restore_config_from_cell(const data_value& v) {
std::vector<data_value> dv = value_cast<user_type_impl::native_type>(v);
auto result = locator::restore_config{
value_cast<sstring>(dv[0]),
value_cast<sstring>(dv[1]),
value_cast<sstring>(dv[2]),
};
return result;
}
static
locator::restore_config deserialize_restore_config(cql3::untyped_result_set_row::view_type raw_value) {
return restore_config_from_cell(
restore_config_type->deserialize_value(raw_value));
}
locator::repair_scheduler_config repair_scheduler_config_from_cell(const data_value& v) {
std::vector<data_value> dv = value_cast<user_type_impl::native_type>(v);
auto result = locator::repair_scheduler_config{
@@ -746,6 +789,11 @@ tablet_id process_one_row(replica::database* db, table_id table, tablet_map& map
}
}
std::optional<locator::restore_config> restore_cfg;
if (row.has("restore_config")) {
restore_cfg = deserialize_restore_config(row.get_view("restore_config"));
}
locator::tablet_task_info migration_task_info;
if (row.has("migration_task_info")) {
migration_task_info = deserialize_tablet_task_info(row.get_view("migration_task_info"));
@@ -769,7 +817,7 @@ tablet_id process_one_row(replica::database* db, table_id table, tablet_map& map
session_id = service::session_id(row.get_as<utils::UUID>("session"));
}
map.set_tablet_transition_info(tid, tablet_transition_info{stage, transition,
std::move(new_tablet_replicas), pending_replica, session_id});
std::move(new_tablet_replicas), pending_replica, session_id, std::move(restore_cfg)});
}
tablet_logger.debug("Set sstables_repaired_at={} table={} tablet={}", sstables_repaired_at, table, tid);

View File

@@ -1227,7 +1227,7 @@ fragmented_ostringstream& schema::schema_properties(const schema_describe_helper
map_as_cql_param(os, caching_options().to_map());
os << "}";
os << "\n AND comment = " << cql3::util::single_quote(comment());
os << "\n AND compaction = {'class': '" << compaction::compaction_strategy::name(compaction_strategy()) << "'";
os << "\n AND compaction = {'class': '" << compaction::compaction_strategy::name(configured_compaction_strategy()) << "'";
map_as_cql_param(os, compaction_strategy_options(), false) << "}";
os << "\n AND compression = {";
map_as_cql_param(os, get_compressor_params().get_options());

View File

@@ -5677,6 +5677,21 @@ class scylla_sstable_summary(gdb.Command):
position: 0}
Keys are printed in the hexadecimal notation.
For ms-format (trie-based) sstables, displays data from the
partitions db footer instead:
(gdb) scylla sstable-summary $sst
sstable uses ms format (trie-based index).
first_key: 63617373616e647261
last_key: 63617373616e647261
partition_count: 42
trie_root_position: 12345
If the partitions db footer has not been lazily loaded yet (e.g. the
sstable was opened but never read from), the command will report:
sstable uses ms format but partitions db footer is not loaded.
"""
def __init__(self):
gdb.Command.__init__(self, 'scylla sstable-summary', gdb.COMMAND_USER, gdb.COMPLETE_NONE, True)
@@ -5698,7 +5713,16 @@ class scylla_sstable_summary(gdb.Command):
sst = arg
ms_version = int(gdb.parse_and_eval('sstables::sstable_version_types::ms'))
if int(sst['_version']) >= ms_version:
gdb.write("sstable uses ms format (trie-based index); summary is not populated.\n")
footer_opt = std_optional(sst['_partitions_db_footer'])
if not footer_opt:
gdb.write("sstable uses ms format but partitions db footer is not loaded.\n")
return
footer = footer_opt.get()
gdb.write("sstable uses ms format (trie-based index).\n")
gdb.write("first_key: {}\n".format(sstring(footer['first_key']['_bytes'])))
gdb.write("last_key: {}\n".format(sstring(footer['last_key']['_bytes'])))
gdb.write("partition_count: {}\n".format(footer['partition_count']))
gdb.write("trie_root_position: {}\n".format(footer['trie_root_position']))
return
summary = seastar_lw_shared_ptr(sst['_components']['_value']).get().dereference()['summary']

View File

@@ -793,16 +793,18 @@ static future<> add_view_building_tasks_mutations(storage_proxy& sp, view_ptr vi
auto& db = sp.local_db();
auto& sys_ks = sp.system_keyspace();
auto& vb_sm = sp.view_building_state_machine();
auto base_id = view->view_info()->base_id();
auto& base_cf = db.find_column_family(base_id);
auto erm = base_cf.get_effective_replication_map();
auto& tablet_map = erm->get_token_metadata().tablets().get_tablet_map(base_id);
auto uuid_gen = vb_sm.building_state.make_task_uuid_generator(ts);
co_await tablet_map.for_each_tablet([&] (auto tid, const auto& tablet_info) -> future<> {
auto last_token = tablet_map.get_last_token(tid);
for (auto& replica: tablet_info.replicas) {
auto id = utils::UUID_gen::get_time_UUID();
auto id = uuid_gen();
view_building_task task {
id, view_building_task::task_type::build_range, false,
base_id, view->id(), replica, last_token

View File

@@ -26,6 +26,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include "service/qos/raft_service_level_distributed_data_accessor.hh"
#include "service_level_controller.hh"
#include "db/system_distributed_keyspace.hh"
#include "cql3/query_processor.hh"
#include "service/storage_service.hh"
#include "service/topology_state_machine.hh"
@@ -739,6 +740,80 @@ future<> service_level_controller::do_add_service_level(sstring name, service_le
return make_ready_future();
}
future<> service_level_controller::migrate_to_v2(size_t nodes_count, db::system_keyspace& sys_ks, cql3::query_processor& qp, service::raft_group0_client& group0_client, abort_source& as) {
//TODO:
//Now we trust the administrator to not make changes to service levels during the migration.
//Ideally, during the migration we should set migration data accessor(on all nodes, on all shards) that allows to read but forbids writes
using namespace std::chrono_literals;
auto schema = qp.db().find_schema(db::system_distributed_keyspace::NAME, db::system_distributed_keyspace::SERVICE_LEVELS);
const auto t = 5min;
const timeout_config tc{t, t, t, t, t, t, t};
service::client_state cs(::service::client_state::internal_tag{}, tc);
service::query_state qs(cs, empty_service_permit());
// `system_distributed` keyspace has RF=3 and we need to scan it with CL=ALL
// To support migration on cluster with 1 or 2 nodes, set appropriate CL
auto cl = db::consistency_level::ALL;
if (nodes_count == 1) {
cl = db::consistency_level::ONE;
} else if (nodes_count == 2) {
cl = db::consistency_level::TWO;
}
auto rows = co_await qp.execute_internal(
format("SELECT * FROM {}.{}", db::system_distributed_keyspace::NAME, db::system_distributed_keyspace::SERVICE_LEVELS),
cl,
qs,
{},
cql3::query_processor::cache_internal::no);
auto col_names = schema->all_columns() | std::views::transform([] (const auto& col) {return col.name_as_cql_string(); }) | std::ranges::to<std::vector<sstring>>();
auto col_names_str = fmt::to_string(fmt::join(col_names, ", "));
sstring val_binders_str = "?";
for (size_t i = 1; i < col_names.size(); ++i) {
val_binders_str += ", ?";
}
auto guard = co_await group0_client.start_operation(as);
utils::chunked_vector<mutation> migration_muts;
for (const auto& row: *rows) {
std::vector<data_value_or_unset> values;
for (const auto& col: schema->all_columns()) {
if (row.has(col.name_as_text())) {
values.push_back(col.type->deserialize(row.get_blob_unfragmented(col.name_as_text())));
} else {
values.push_back(unset_value{});
}
}
auto muts = co_await qp.get_mutations_internal(
seastar::format("INSERT INTO {}.{} ({}) VALUES ({})",
db::system_keyspace::NAME,
db::system_keyspace::SERVICE_LEVELS_V2,
col_names_str,
val_binders_str),
qos_query_state(),
guard.write_timestamp(),
std::move(values));
if (muts.size() != 1) {
on_internal_error(sl_logger, format("expecting single insert mutation, got {}", muts.size()));
}
migration_muts.push_back(std::move(muts[0]));
}
auto status_mut = co_await sys_ks.make_service_levels_version_mutation(2, guard.write_timestamp());
migration_muts.push_back(std::move(status_mut));
service::write_mutations change {
.mutations{migration_muts.begin(), migration_muts.end()},
};
auto group0_cmd = group0_client.prepare_command(change, guard, "migrate service levels to v2");
co_await group0_client.add_entry(std::move(group0_cmd), std::move(guard), as);
}
future<> service_level_controller::do_remove_service_level(sstring name, bool remove_static) {
auto service_level_it = _service_levels_db.find(name);
if (service_level_it != _service_levels_db.end()) {

View File

@@ -396,6 +396,11 @@ public:
return _sl_data_accessor->commit_mutations(std::move(mc), _global_controller_db->group0_aborter);
}
/**
* Migrate data from `system_distributed.service_levels` to `system.service_levels_v2`.
*/
static future<> migrate_to_v2(size_t nodes_count, db::system_keyspace& sys_ks, cql3::query_processor& qp, service::raft_group0_client& group0_client, abort_source& as);
private:
/**
* Adds a service level configuration if it doesn't exists, and updates

View File

@@ -590,7 +590,7 @@ private:
storage_proxy::clock_type::time_point timeout;
if (!t) {
auto timeout_in_ms = _sp._db.local().get_config().write_request_timeout_in_ms();
auto timeout_in_ms = _sp._timeout_config.write_timeout_in_ms();
timeout = clock_type::now() + std::chrono::milliseconds(timeout_in_ms);
} else {
timeout = *t;
@@ -3321,7 +3321,8 @@ storage_proxy::~storage_proxy() {
}
storage_proxy::storage_proxy(sharded<replica::database>& db, storage_proxy::config cfg, db::view::node_update_backlog& max_view_update_backlog,
scheduling_group_key stats_key, gms::feature_service& feat, const locator::shared_token_metadata& stm, locator::effective_replication_map_factory& erm_factory)
scheduling_group_key stats_key, gms::feature_service& feat, const locator::shared_token_metadata& stm, locator::effective_replication_map_factory& erm_factory,
updateable_timeout_config& timeout_config)
: _db(db)
, _shared_token_metadata(stm)
, _erm_factory(erm_factory)
@@ -3341,6 +3342,7 @@ storage_proxy::storage_proxy(sharded<replica::database>& db, storage_proxy::conf
, _background_write_throttle_threahsold(cfg.available_memory / 10)
, _mutate_stage{"storage_proxy_mutate", &storage_proxy::do_mutate}
, _max_view_update_backlog(max_view_update_backlog)
, _timeout_config(timeout_config)
, _cancellable_write_handlers_list(std::make_unique<cancellable_write_handlers_list>())
{
namespace sm = seastar::metrics;
@@ -3970,7 +3972,7 @@ future<result<>> storage_proxy::mutate_begin(unique_response_handler_vector ids,
// frozen_mutation copy, or manage handler live time differently.
hint_to_dead_endpoints(response_id, cl);
auto timeout = timeout_opt.value_or(clock_type::now() + std::chrono::milliseconds(_db.local().get_config().write_request_timeout_in_ms()));
auto timeout = timeout_opt.value_or(clock_type::now() + std::chrono::milliseconds(_timeout_config.write_timeout_in_ms()));
// call before send_to_live_endpoints() for the same reason as above
auto f = response_wait(response_id, timeout);
send_to_live_endpoints(protected_response.release(), timeout); // response is now running and it will either complete or timeout
@@ -5942,7 +5944,7 @@ public:
// occur within write_timeout of a write, as these are the cases where repair is most
// beneficial.
if (is_datacenter_local(exec->_cl) && exec->_cmd->read_timestamp >= 0 && digest_resolver->last_modified() >= 0) {
auto write_timeout = exec->_proxy->_db.local().get_config().write_request_timeout_in_ms() * 1000;
auto write_timeout = exec->_proxy->_timeout_config.write_timeout_in_ms() * 1000;
auto delta = int64_t(digest_resolver->last_modified()) - int64_t(exec->_cmd->read_timestamp);
if (std::abs(delta) <= write_timeout) {
exec->_proxy->get_stats().global_read_repairs_canceled_due_to_concurrent_write++;
@@ -6066,7 +6068,7 @@ public:
});
auto& sr = _schema->speculative_retry();
auto t = (sr.get_type() == speculative_retry::type::PERCENTILE) ?
std::min(_cf->get_coordinator_read_latency_percentile(sr.get_value()), std::chrono::milliseconds(_proxy->get_db().local().get_config().read_request_timeout_in_ms()/2)) :
std::min(_cf->get_coordinator_read_latency_percentile(sr.get_value()), std::chrono::milliseconds(_proxy->_timeout_config.read_timeout_in_ms()/2)) :
std::chrono::milliseconds(unsigned(sr.get_value()));
_speculate_timer.arm(t);
resolver->set_on_disconnect([this] {
@@ -6784,7 +6786,7 @@ storage_proxy::do_query_with_paxos(schema_ptr s,
db::timeout_clock::time_point timeout = query_options.timeout(*this);
// When to give up due to contention
db::timeout_clock::time_point cas_timeout = db::timeout_clock::now() +
std::chrono::milliseconds(_db.local().get_config().cas_contention_timeout_in_ms());
std::chrono::milliseconds(_timeout_config.cas_timeout_in_ms());
struct read_cas_request : public cas_request {
foreign_ptr<lw_shared_ptr<query::result>> res;

View File

@@ -41,6 +41,7 @@
#include "service/storage_service.hh"
#include "service/cas_shard.hh"
#include "service/maintenance_mode.hh"
#include "timeout_config.hh"
#include "service/storage_proxy_fwd.hh"
class reconcilable_result;
@@ -319,6 +320,7 @@ private:
lw_shared_ptr<cdc::operation_result_tracker>,
coordinator_mutate_options> _mutate_stage;
db::view::node_update_backlog& _max_view_update_backlog;
updateable_timeout_config& _timeout_config;
std::unordered_map<locator::host_id, view_update_backlog_timestamped> _view_update_backlogs;
//NOTICE(sarna): This opaque pointer is here just to avoid moving write handler class definitions from .cc to .hh. It's slow path.
@@ -528,7 +530,7 @@ private:
public:
storage_proxy(sharded<replica::database>& db, config cfg, db::view::node_update_backlog& max_view_update_backlog,
scheduling_group_key stats_key, gms::feature_service& feat, const locator::shared_token_metadata& stm,
locator::effective_replication_map_factory& erm_factory);
locator::effective_replication_map_factory& erm_factory, updateable_timeout_config& timeout_config);
~storage_proxy();
const sharded<replica::database>& get_db() const {

View File

@@ -806,7 +806,7 @@ future<> storage_service::view_building_state_load() {
};
auto vb_tasks = co_await _sys_ks.local().get_view_building_tasks();
auto [vb_tasks, min_alive_uuid] = co_await _sys_ks.local().get_view_building_tasks();
auto processing_base_table = co_await _sys_ks.local().get_view_building_processing_base_id();
std::map<table_id, std::vector<table_id>> views_per_base;
@@ -825,7 +825,7 @@ future<> storage_service::view_building_state_load() {
})
| std::ranges::to<db::view::views_state::view_build_status_map>();
db::view::view_building_state building_state {std::move(vb_tasks), std::move(processing_base_table)};
db::view::view_building_state building_state {std::move(vb_tasks), std::move(processing_base_table), std::move(min_alive_uuid)};
db::view::views_state views_state {std::move(views_per_base), std::move(status_map)};
_view_building_state_machine.building_state = std::move(building_state);
@@ -1374,16 +1374,19 @@ future<> storage_service::raft_initialize_discovery_leader(const join_node_reque
auto enable_features_mutation = builder.build();
insert_join_request_mutations.push_back(std::move(enable_features_mutation));
auto sl_status_mutation = co_await _sys_ks.local().make_service_levels_version_mutation(2, write_timestamp);
auto skip_service_levels_v2_initialization = utils::get_local_injector().enter("skip_service_levels_v2_initialization");
auto sl_status_mutation = co_await _sys_ks.local().make_service_levels_version_mutation(skip_service_levels_v2_initialization ? 1 : 2, write_timestamp);
insert_join_request_mutations.emplace_back(std::move(sl_status_mutation));
insert_join_request_mutations.emplace_back(co_await _sys_ks.local().make_auth_version_mutation(write_timestamp, db::system_keyspace::auth_version_t::v2));
insert_join_request_mutations.emplace_back(co_await _sys_ks.local().make_view_builder_version_mutation(write_timestamp, db::system_keyspace::view_builder_version_t::v2));
auto sl_driver_mutations = co_await qos::service_level_controller::get_create_driver_service_level_mutations(_sys_ks.local(), write_timestamp);
for (auto& m : sl_driver_mutations) {
insert_join_request_mutations.emplace_back(m);
if (!skip_service_levels_v2_initialization) {
auto sl_driver_mutations = co_await qos::service_level_controller::get_create_driver_service_level_mutations(_sys_ks.local(), write_timestamp);
for (auto& m : sl_driver_mutations) {
insert_join_request_mutations.emplace_back(m);
}
}
topology_change change{std::move(insert_join_request_mutations)};
@@ -2731,13 +2734,23 @@ future<> storage_service::decommission(sharded<db::snapshot_ctl>& snapshot_ctl)
throw std::runtime_error(::format("Node in {} state; wait for status to become normal or restart", ss._operation_mode));
}
ss.raft_decommission().get();
// SCYLLADB-1693. In case we abort, the snapshot/backup mechanism need
// to remain open. Move it to after raft_decommission.
// In the case of a cluster snapshot, our nodes ownership
// or not of tables will be serialized by raft anyway, so
// should remain consistent. In that case we at worst coordinate
// from a node in "leave" status
// In the case of a local snapshot, ownership matters less,
// only sstables on disk, which should not change.
// In the case of backup, this operates on a snapshot, state of which
// is not affected.
snapshot_ctl.invoke_on_all([](auto& sctl) {
return sctl.disable_all_operations();
}).get();
slogger.info("DECOMMISSIONING: disabled backup and snapshots");
ss.raft_decommission().get();
ss.stop_transport().get();
slogger.info("DECOMMISSIONING: stopped transport");
@@ -4803,8 +4816,13 @@ future<raft_topology_cmd_result> storage_service::raft_topology_cmd_handler(raft
}
} catch (const raft::request_aborted& e) {
rtlogger.warn("raft_topology_cmd {} failed with: {}", cmd.cmd, e);
result.error_message = e.what();
} catch (const std::exception& e) {
rtlogger.error("raft_topology_cmd {} failed with: {}", cmd.cmd, e);
result.error_message = e.what();
} catch (...) {
rtlogger.error("raft_topology_cmd {} failed with: {}", cmd.cmd, std::current_exception());
result.error_message = "unknown error";
}
rtlogger.info("topology cmd rpc {} completed with status={} index={}",
@@ -5622,6 +5640,71 @@ future<> storage_service::del_tablet_replica(table_id table, dht::token token, l
});
}
future<> storage_service::restore_tablets(table_id table, sstring snap_name, sstring endpoint, sstring bucket) {
auto holder = _async_gate.hold();
if (this_shard_id() != 0) {
// group0 is only set on shard 0.
co_return co_await container().invoke_on(0, [&] (auto& ss) {
return ss.restore_tablets(table, snap_name, endpoint, bucket);
});
}
// Holding tm around transit_tablet() can lead to deadlock, if state machine is busy
// with something which executes a barrier. The barrier will wait for tm to die, and
// transit_tablet() will wait for the barrier to finish.
// Due to that, we first collect tablet boundaries, then prepare and submit transition
// mutations. Since this code is called with equal min:max tokens set for the table,
// the tablet map cannot split and merge and, thus, the static vector of tokens should
// map to correct tablet boundaries throughout the whole operation
utils::chunked_vector<std::pair<locator::tablet_id, dht::token>> tablets;
{
const auto tm = get_token_metadata_ptr();
const auto& tmap = tm->tablets().get_tablet_map(table);
co_await tmap.for_each_tablet([&] (locator::tablet_id tid, const locator::tablet_info& info) {
auto last_token = tmap.get_last_token(tid);
tablets.push_back(std::make_pair(tid, last_token));
return make_ready_future<>();
});
}
auto wait_one_transition = [this] (locator::global_tablet_id gid) {
return _topology_state_machine.event.wait([this, gid] {
auto& tmap = get_token_metadata().tablets().get_tablet_map(gid.table);
return !tmap.get_tablet_transition_info(gid.tablet);
});
};
std::vector<future<>> wait;
co_await coroutine::parallel_for_each(tablets, [&] (const auto& tablet) -> future<> {
auto [ tid, last_token ] = tablet;
auto gid = locator::global_tablet_id{table, tid};
while (true) {
auto success = co_await try_transit_tablet(table, last_token, [&] (const locator::tablet_map& tmap, api::timestamp_type write_timestamp) {
utils::chunked_vector<canonical_mutation> updates;
updates.emplace_back(tablet_mutation_builder_for_base_table(write_timestamp, table)
.set_stage(last_token, locator::tablet_transition_stage::restore)
.set_new_replicas(last_token, tmap.get_tablet_info(tid).replicas)
.set_restore_config(last_token, locator::restore_config{ snap_name, endpoint, bucket })
.set_transition(last_token, locator::tablet_transition_kind::restore)
.build());
sstring reason = format("Restoring tablet {}", gid);
return std::make_tuple(std::move(updates), std::move(reason));
});
if (success) {
wait.emplace_back(wait_one_transition(gid));
break;
}
slogger.debug("Tablet is in transition, waiting");
co_await wait_one_transition(gid);
}
});
co_await when_all_succeed(wait.begin(), wait.end()).discard_result();
slogger.info("Restoring {} finished", table);
}
future<locator::load_stats> storage_service::load_stats_for_tablet_based_tables() {
auto holder = _async_gate.hold();
@@ -5704,6 +5787,21 @@ future<locator::load_stats> storage_service::load_stats_for_tablet_based_tables(
}
future<> storage_service::transit_tablet(table_id table, dht::token token, noncopyable_function<std::tuple<utils::chunked_vector<canonical_mutation>, sstring>(const locator::tablet_map&, api::timestamp_type)> prepare_mutations) {
auto success = co_await try_transit_tablet(table, token, std::move(prepare_mutations));
if (!success) {
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
auto tid = tmap.get_tablet_id(token);
throw std::runtime_error(fmt::format("Tablet {} is in transition", locator::global_tablet_id{table, tid}));
}
// Wait for transition to finish.
co_await _topology_state_machine.event.when([&] {
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
return !tmap.get_tablet_transition_info(tmap.get_tablet_id(token));
});
}
future<bool> storage_service::try_transit_tablet(table_id table, dht::token token, noncopyable_function<std::tuple<utils::chunked_vector<canonical_mutation>, sstring>(const locator::tablet_map&, api::timestamp_type)> prepare_mutations) {
while (true) {
auto guard = co_await _group0->client().start_operation(_group0_as, raft_timeout{});
bool topology_busy;
@@ -5723,7 +5821,7 @@ future<> storage_service::transit_tablet(table_id table, dht::token token, nonco
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
auto tid = tmap.get_tablet_id(token);
if (tmap.get_tablet_transition_info(tid)) {
throw std::runtime_error(fmt::format("Tablet {} is in transition", locator::global_tablet_id{table, tid}));
co_return false;
}
auto [ updates, reason ] = prepare_mutations(tmap, guard.write_timestamp());
@@ -5753,11 +5851,7 @@ future<> storage_service::transit_tablet(table_id table, dht::token token, nonco
}
}
// Wait for transition to finish.
co_await _topology_state_machine.event.when([&] {
auto& tmap = get_token_metadata().tablets().get_tablet_map(table);
return !tmap.get_tablet_transition_info(tmap.get_tablet_id(token));
});
co_return true;
}
future<> storage_service::set_tablet_balancing_enabled(bool enabled) {
@@ -6164,6 +6258,15 @@ node_state storage_service::get_node_state(locator::host_id id) {
return p->second.state;
}
void storage_service::check_raft_rpc(raft::server_id dst_id) {
if (!_group0 || !_group0->joined_group0()) {
throw std::runtime_error("The node did not join group 0 yet");
}
if (_group0->load_my_id() != dst_id) {
throw raft_destination_id_not_correct(_group0->load_my_id(), dst_id);
}
}
void storage_service::init_messaging_service() {
ser::node_ops_rpc_verbs::register_node_ops_cmd(&_messaging.local(), [this] (const rpc::client_info& cinfo, node_ops_cmd_request req) {
auto coordinator = cinfo.retrieve_auxiliary<gms::inet_address>("baddr");
@@ -6175,17 +6278,6 @@ void storage_service::init_messaging_service() {
return ss.node_ops_cmd_handler(coordinator, coordinator_host_id, std::move(req));
});
});
auto handle_raft_rpc = [this] (raft::server_id dst_id, auto handler) {
return container().invoke_on(0, [dst_id, handler = std::move(handler)] (auto& ss) mutable {
if (!ss._group0 || !ss._group0->joined_group0()) {
throw std::runtime_error("The node did not join group 0 yet");
}
if (ss._group0->load_my_id() != dst_id) {
throw raft_destination_id_not_correct(ss._group0->load_my_id(), dst_id);
}
return handler(ss);
});
};
ser::streaming_rpc_verbs::register_tablet_stream_files(&_messaging.local(),
[this] (const rpc::client_info& cinfo, streaming::stream_files_request req) -> future<streaming::stream_files_response> {
streaming::stream_files_response resp;
@@ -6197,13 +6289,13 @@ void storage_service::init_messaging_service() {
std::plus<size_t>());
co_return resp;
});
ser::storage_service_rpc_verbs::register_raft_topology_cmd(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, raft::term_t term, uint64_t cmd_index, raft_topology_cmd cmd) {
ser::storage_service_rpc_verbs::register_raft_topology_cmd(&_messaging.local(), [this] (raft::server_id dst_id, raft::term_t term, uint64_t cmd_index, raft_topology_cmd cmd) {
return handle_raft_rpc(dst_id, [cmd = std::move(cmd), term, cmd_index] (auto& ss) {
check_raft_rpc_scheduling_group(ss._db.local(), ss._feature_service, "raft_topology_cmd");
return ss.raft_topology_cmd_handler(term, cmd_index, cmd);
});
});
ser::storage_service_rpc_verbs::register_raft_pull_snapshot(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, raft_snapshot_pull_params params) {
ser::storage_service_rpc_verbs::register_raft_pull_snapshot(&_messaging.local(), [this] (raft::server_id dst_id, raft_snapshot_pull_params params) {
return handle_raft_rpc(dst_id, [params = std::move(params)] (storage_service& ss) -> future<raft_snapshot> {
check_raft_rpc_scheduling_group(ss._db.local(), ss._feature_service, "raft_pull_snapshot");
utils::chunked_vector<canonical_mutation> mutations;
@@ -6298,28 +6390,28 @@ void storage_service::init_messaging_service() {
};
});
});
ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
ser::storage_service_rpc_verbs::register_tablet_stream_data(&_messaging.local(), [this] (raft::server_id dst_id, locator::global_tablet_id tablet) {
return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.stream_tablet(tablet);
});
});
ser::storage_service_rpc_verbs::register_tablet_repair(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet, rpc::optional<service::session_id> session_id) {
ser::storage_service_rpc_verbs::register_tablet_repair(&_messaging.local(), [this] (raft::server_id dst_id, locator::global_tablet_id tablet, rpc::optional<service::session_id> session_id) {
return handle_raft_rpc(dst_id, [tablet, session_id = session_id.value_or(service::session_id::create_null_id())] (auto& ss) -> future<service::tablet_operation_repair_result> {
auto res = co_await ss.repair_tablet(tablet, session_id);
co_return res;
});
});
ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, locator::global_tablet_id tablet) {
ser::storage_service_rpc_verbs::register_tablet_cleanup(&_messaging.local(), [this] (raft::server_id dst_id, locator::global_tablet_id tablet) {
return handle_raft_rpc(dst_id, [tablet] (auto& ss) {
return ss.cleanup_tablet(tablet);
});
});
ser::storage_service_rpc_verbs::register_table_load_stats(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id) {
ser::storage_service_rpc_verbs::register_table_load_stats(&_messaging.local(), [this] (raft::server_id dst_id) {
return handle_raft_rpc(dst_id, [] (auto& ss) mutable {
return ss.load_stats_for_tablet_based_tables();
});
});
ser::storage_service_rpc_verbs::register_table_load_stats_v1(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id) {
ser::storage_service_rpc_verbs::register_table_load_stats_v1(&_messaging.local(), [this] (raft::server_id dst_id) {
return handle_raft_rpc(dst_id, [] (auto& ss) mutable {
return ss.load_stats_for_tablet_based_tables().then([] (auto stats) {
return locator::load_stats_v1{ .tables = std::move(stats.tables) };
@@ -6340,7 +6432,7 @@ void storage_service::init_messaging_service() {
ser::storage_service_rpc_verbs::register_sample_sstables(&_messaging.local(), [this] (table_id table, uint64_t chunk_size, uint64_t n_chunks) -> future<utils::chunked_vector<temporary_buffer<char>>> {
return _db.local().sample_data_files(table, chunk_size, n_chunks);
});
ser::join_node_rpc_verbs::register_join_node_request(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, service::join_node_request_params params) {
ser::join_node_rpc_verbs::register_join_node_request(&_messaging.local(), [this] (raft::server_id dst_id, service::join_node_request_params params) {
return handle_raft_rpc(dst_id, [params = std::move(params)] (auto& ss) mutable {
check_raft_rpc_scheduling_group(ss._db.local(), ss._feature_service, "join_node_request");
return ss.join_node_request_handler(std::move(params));
@@ -6356,7 +6448,7 @@ void storage_service::init_messaging_service() {
co_return co_await ss.join_node_response_handler(std::move(params));
});
});
ser::join_node_rpc_verbs::register_join_node_query(&_messaging.local(), [handle_raft_rpc] (raft::server_id dst_id, service::join_node_query_params) {
ser::join_node_rpc_verbs::register_join_node_query(&_messaging.local(), [this] (raft::server_id dst_id, service::join_node_query_params) {
return handle_raft_rpc(dst_id, [] (auto& ss) -> future<join_node_query_result> {
check_raft_rpc_scheduling_group(ss._db.local(), ss._feature_service, "join_node_query");
auto result = join_node_query_result{

View File

@@ -230,9 +230,6 @@ private:
shared_ptr<service::topo::task_manager_module> _global_topology_requests_module;
shared_ptr<service::vnodes_to_tablets::task_manager_module> _vnodes_to_tablets_migration_module;
gms::gossip_address_map& _address_map;
future<service::tablet_operation_result> do_tablet_operation(locator::global_tablet_id tablet,
sstring op_name,
std::function<future<service::tablet_operation_result>(locator::tablet_metadata_guard&)> op);
future<service::tablet_operation_repair_result> repair_tablet(locator::global_tablet_id, service::session_id);
future<> stream_tablet(locator::global_tablet_id);
// Clones storage of leaving tablet into pending one. Done in the context of intra-node migration,
@@ -244,7 +241,20 @@ private:
future<> process_tablet_split_candidate(table_id) noexcept;
void register_tablet_split_candidate(table_id) noexcept;
future<> run_tablet_split_monitor();
void check_raft_rpc(raft::server_id dst);
public:
future<service::tablet_operation_result> do_tablet_operation(locator::global_tablet_id tablet,
sstring op_name,
std::function<future<service::tablet_operation_result>(locator::tablet_metadata_guard&)> op);
template <typename Func>
auto handle_raft_rpc(raft::server_id dst_id, Func&& handler) {
return container().invoke_on(0, [dst_id, handler = std::forward<Func>(handler)] (auto& ss) mutable {
ss.check_raft_rpc(dst_id);
return handler(ss);
});
};
storage_service(abort_source& as, sharded<replica::database>& db,
gms::gossiper& gossiper,
sharded<db::system_keyspace>&,
@@ -951,6 +961,7 @@ private:
future<> _upgrade_to_topology_coordinator_fiber = make_ready_future<>();
future<> transit_tablet(table_id, dht::token, noncopyable_function<std::tuple<utils::chunked_vector<canonical_mutation>, sstring>(const locator::tablet_map& tmap, api::timestamp_type)> prepare_mutations);
future<bool> try_transit_tablet(table_id, dht::token, noncopyable_function<std::tuple<utils::chunked_vector<canonical_mutation>, sstring>(const locator::tablet_map& tmap, api::timestamp_type)> prepare_mutations);
future<service::group0_guard> get_guard_for_tablet_update();
future<bool> exec_tablet_update(service::group0_guard guard, utils::chunked_vector<canonical_mutation> updates, sstring reason);
public:
@@ -960,6 +971,7 @@ public:
future<> move_tablet(table_id, dht::token, locator::tablet_replica src, locator::tablet_replica dst, loosen_constraints force = loosen_constraints::no);
future<> add_tablet_replica(table_id, dht::token, locator::tablet_replica dst, loosen_constraints force = loosen_constraints::no);
future<> del_tablet_replica(table_id, dht::token, locator::tablet_replica dst, loosen_constraints force = loosen_constraints::no);
future<> restore_tablets(table_id, sstring snap_name, sstring endpoint, sstring bucket);
future<> set_tablet_balancing_enabled(bool);
future<> await_topology_quiesced();

View File

@@ -19,10 +19,10 @@
#include "idl/strong_consistency/state_machine.dist.hh"
#include "idl/strong_consistency/state_machine.dist.impl.hh"
#include "gms/gossiper.hh"
#include "utils/histogram_metrics_helper.hh"
namespace service::strong_consistency {
static logging::logger logger("sc_coordinator");
// FIXME: Once the drivers support new error codes corresponding
@@ -49,6 +49,68 @@ struct read_timeout : public exceptions::read_timeout_exception {
{}
};
void stats::register_stats() {
namespace sm = seastar::metrics;
sm::label reason_label("reason");
_metrics.add_group("strong_consistency_coordinator", {
sm::make_summary("write_latency_summary", sm::description("Strong consistency write latency summary"),
[this] { return to_metrics_summary(write.summary()); }).set_skip_when_empty(),
sm::make_histogram("write_latency", sm::description("Strong consistency write latency histogram"),
{}, [this] { return to_metrics_histogram(write.histogram()); })
.aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
sm::make_counter("write_errors", write_errors_timeout,
sm::description("number of strong consistency write requests that failed"),
{reason_label("timeout")})
.set_skip_when_empty(),
sm::make_counter("write_errors", write_errors_status_unknown,
sm::description("number of strong consistency write requests that failed"),
{reason_label("status_unknown")})
.set_skip_when_empty(),
sm::make_counter("write_errors", write_errors_other,
sm::description("number of strong consistency write requests that failed"),
{reason_label("other")})
.set_skip_when_empty(),
sm::make_counter("write_node_bounces", write_node_bounces,
sm::description("number of strong consistency write requests bounced to another node"))
.set_skip_when_empty(),
sm::make_counter("write_shard_bounces", write_shard_bounces,
sm::description("number of strong consistency write requests bounced to another shard"))
.set_skip_when_empty(),
sm::make_summary("read_latency_summary", sm::description("Strong consistency read latency summary"),
[this] { return to_metrics_summary(read.summary()); }).set_skip_when_empty(),
sm::make_histogram("read_latency", sm::description("Strong consistency read latency histogram"),
{}, [this] { return to_metrics_histogram(read.histogram()); })
.aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
sm::make_counter("read_errors", read_errors_timeout,
sm::description("number of strong consistency read requests that failed"),
{reason_label("timeout")})
.set_skip_when_empty(),
sm::make_counter("read_errors", read_errors_other,
sm::description("number of strong consistency read requests that failed"),
{reason_label("other")})
.set_skip_when_empty(),
sm::make_counter("read_node_bounces", read_node_bounces,
sm::description("number of strong consistency read requests bounced to another node"))
.set_skip_when_empty(),
sm::make_counter("read_shard_bounces", read_shard_bounces,
sm::description("number of strong consistency read requests bounced to another shard"))
.set_skip_when_empty(),
});
}
static const locator::tablet_replica* find_replica(const locator::tablet_info& tinfo, locator::host_id id) {
const auto it = std::ranges::find_if(tinfo.replicas,
[&] (const locator::tablet_replica& r) {
@@ -170,6 +232,7 @@ coordinator::coordinator(groups_manager& groups_manager, replica::database& db,
, _db(db)
, _gossiper(gossiper)
{
_stats.register_stats();
}
future<value_or_redirect<>> coordinator::mutate(schema_ptr schema,
@@ -181,6 +244,11 @@ future<value_or_redirect<>> coordinator::mutate(schema_ptr schema,
auto aoe = abort_on_expiry<timeout_clock>(timeout);
[[maybe_unused]] const auto subs = chain_abort_sources(aoe.abort_source(), as);
utils::latency_counter lc;
lc.start();
auto mark_write_latency = defer([this, &lc] { _stats.write.mark(lc.stop().latency()); });
bool commit_status_unknown_ex = false;
try {
auto op_result = co_await create_operation_ctx(*schema, token, aoe.abort_source());
if (const auto* redirect = get_if<need_redirect>(&op_result)) {
@@ -245,7 +313,11 @@ future<value_or_redirect<>> coordinator::mutate(schema_ptr schema,
logger.debug("mutate(): add_entry, got commit_status_unknown {}, table {}.{}, tablet {}, term {}",
ex, schema->ks_name(), schema->cf_name(), op.tablet_id, term);
++_stats.write_errors_status_unknown;
// FIXME: use a dedicated ERROR_CODE instead of SERVER_ERROR
// FIXME: when a dedicated ERROR_CODE will be used,
// we can get rid of the boolean flag
commit_status_unknown_ex = true;
throw exceptions::server_exception(
"The outcome of this statement is unknown. It may or may not have been applied. "
"Retrying the statement may be necessary.");
@@ -271,8 +343,12 @@ future<value_or_redirect<>> coordinator::mutate(schema_ptr schema,
|| try_catch<seastar::timed_out_error>(ex) || try_catch<seastar::condition_variable_timed_out>(ex)) {
logger.trace("mutate(): request timed out with error {}, table {}.{}, token {}",
ex, schema->ks_name(), schema->cf_name(), token);
++_stats.write_errors_timeout;
co_return coroutine::return_exception(write_timeout(schema->ks_name(), schema->cf_name()));
} else {
if (!commit_status_unknown_ex) {
++_stats.write_errors_other;
}
logger.trace("mutate(): unknown exception {}, table {}.{}, token {}",
ex, schema->ks_name(), schema->cf_name(), token);
// We know nothing about other errors. Let the CQL server convert them to SERVER_ERROR.
@@ -292,6 +368,10 @@ auto coordinator::query(schema_ptr schema,
auto aoe = abort_on_expiry<timeout_clock>(timeout);
[[maybe_unused]] const auto subs = chain_abort_sources(aoe.abort_source(), as);
utils::latency_counter lc;
lc.start();
auto mark_read_latency = defer([this, &lc] { _stats.read.mark(lc.stop().latency()); });
try {
auto op_result = co_await create_operation_ctx(*schema, ranges[0].start()->value().token(), aoe.abort_source());
if (const auto* redirect = get_if<need_redirect>(&op_result)) {
@@ -323,10 +403,12 @@ auto coordinator::query(schema_ptr schema,
|| try_catch<timed_out_error>(ex)) {
logger.trace("query(): request timed out with error {}, table {}.{}, read cmd {}",
ex, schema->ks_name(), schema->cf_name(), cmd);
++_stats.read_errors_timeout;
co_return coroutine::return_exception(read_timeout(schema->ks_name(), schema->cf_name()));
} else {
logger.trace("mutate(): unknown exception {}, table {}.{}, read cmd {}",
ex, schema->ks_name(), schema->cf_name(), cmd);
++_stats.read_errors_other;
// We know nothing about other errors. Let the CQL server convert them to SERVER_ERROR.
throw;
}

View File

@@ -10,6 +10,8 @@
#include "mutation/mutation.hh"
#include "query/query-result.hh"
#include "utils/histogram.hh"
#include <seastar/core/metrics.hh>
namespace gms {
@@ -27,6 +29,25 @@ struct need_redirect {
template <typename T = std::monostate>
using value_or_redirect = std::variant<T, need_redirect>;
struct stats {
utils::timed_rate_moving_average_summary_and_histogram write;
uint64_t write_errors_timeout = 0;
uint64_t write_errors_status_unknown = 0;
uint64_t write_errors_other = 0;
uint64_t write_node_bounces = 0;
uint64_t write_shard_bounces = 0;
utils::timed_rate_moving_average_summary_and_histogram read;
uint64_t read_errors_timeout = 0;
uint64_t read_errors_other = 0;
uint64_t read_node_bounces = 0;
uint64_t read_shard_bounces = 0;
seastar::metrics::metric_groups _metrics;
void register_stats();
};
class coordinator : public peering_sharded_service<coordinator> {
public:
using timeout_clock = typename db::timeout_clock;
@@ -35,6 +56,7 @@ private:
groups_manager& _groups_manager;
replica::database& _db;
gms::gossiper& _gossiper;
stats _stats;
struct operation_ctx;
future<value_or_redirect<operation_ctx>> create_operation_ctx(const schema& schema,
@@ -43,6 +65,8 @@ private:
public:
coordinator(groups_manager& groups_manager, replica::database& db, gms::gossiper& gossiper);
stats& get_stats() { return _stats; }
using mutation_gen = noncopyable_function<mutation(api::timestamp_type)>;
future<value_or_redirect<>> mutate(schema_ptr schema,
const dht::token& token,

View File

@@ -57,6 +57,8 @@ void load_balancer_stats_manager::setup_metrics(const dc_name& dc, load_balancer
stats.migrations_skipped)(dc_lb),
sm::make_counter("cross_rack_collocations", sm::description("number of co-locating migrations which move replica across racks"),
stats.cross_rack_collocations)(dc_lb),
sm::make_counter("rebuilds_produced", sm::description("number of rebuilds produced by the load balancer"),
stats.rebuilds_produced)(dc_lb),
});
}
@@ -83,7 +85,9 @@ void load_balancer_stats_manager::setup_metrics(load_balancer_cluster_stats& sta
sm::make_counter("auto_repair_needs_repair_nr", sm::description("number of tablets with auto repair enabled that currently needs repair"),
stats.auto_repair_needs_repair_nr),
sm::make_counter("auto_repair_enabled_nr", sm::description("number of tablets with auto repair enabled"),
stats.auto_repair_enabled_nr)
stats.auto_repair_enabled_nr),
sm::make_counter("repairs_produced", sm::description("number of repairs produced by the load balancer"),
stats.repairs_produced),
});
}
@@ -1010,6 +1014,8 @@ private:
return true;
case tablet_transition_stage::repair:
return true;
case tablet_transition_stage::restore:
return false;
case tablet_transition_stage::end_repair:
return false;
case tablet_transition_stage::write_both_read_new:
@@ -1344,6 +1350,7 @@ public:
auto range = tmap.get_token_range(id);
auto last_token = tmap.get_last_token(id);
plans.push_back(repair_plan{gid, info, range, last_token, diff, is_user_reuqest});
++_stats.for_cluster().repairs_produced;
});
}
@@ -3955,6 +3962,10 @@ public:
_current_stats->migrations_produced++;
mark_as_scheduled(mig);
plan.add(std::move(mig));
if (kind == tablet_transition_kind::rebuild || kind == tablet_transition_kind::rebuild_v2) {
++_current_stats->rebuilds_produced;
}
} else {
// Shards are overloaded with streaming. Do not include the migration in the plan, but
// continue as if it was in the hope that we will find a migration which can be executed without
@@ -4252,10 +4263,10 @@ public:
}
}
// For size based balancing, only excluded nodes are allowed to have incomplete tablet stats
// Only excluded nodes are allowed to have incomplete tablet stats
for (auto& [host, node] : nodes) {
if (!_load_sketch->has_complete_data(host)) {
if (!_force_capacity_based_balancing && node.drained && node.node->is_excluded()) {
if (node.drained && node.node->is_excluded()) {
_load_sketch->ignore_incomplete_data(host);
} else {
lblogger.info("Cannot balance because node {} (or more) has incomplete tablet stats", host);

View File

@@ -48,6 +48,7 @@ struct load_balancer_dc_stats {
uint64_t stop_skip_limit = 0;
uint64_t stop_batch_size = 0;
uint64_t cross_rack_collocations = 0;
uint64_t rebuilds_produced = 0;
load_balancer_dc_stats operator-(const load_balancer_dc_stats& other) const {
return {
@@ -67,6 +68,7 @@ struct load_balancer_dc_stats {
stop_skip_limit - other.stop_skip_limit,
stop_batch_size - other.stop_batch_size,
cross_rack_collocations - other.cross_rack_collocations,
rebuilds_produced - other.rebuilds_produced,
};
}
};
@@ -94,6 +96,8 @@ struct load_balancer_cluster_stats {
uint64_t resizes_finalized = 0;
uint64_t auto_repair_needs_repair_nr = 0;
uint64_t auto_repair_enabled_nr = 0;
uint64_t repairs_produced = 0;
};
using dc_name = sstring;

View File

@@ -63,6 +63,7 @@
#include "utils/stall_free.hh"
#include "utils/to_string.hh"
#include "service/endpoint_lifecycle_subscriber.hh"
#include "sstables_loader.hh"
#include "idl/join_node.dist.hh"
#include "idl/storage_service.dist.hh"
@@ -72,6 +73,7 @@
#include "utils/updateable_value.hh"
#include "repair/repair.hh"
#include "idl/repair.dist.hh"
#include "idl/sstables_loader.dist.hh"
#include "service/topology_coordinator.hh"
@@ -443,8 +445,11 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
co_await ser::storage_service_rpc_verbs::send_raft_topology_cmd(
&_messaging, to_host_id(id), id, _term, cmd_index, cmd);
if (result.status == raft_topology_cmd_result::command_status::fail) {
auto msg = result.error_message.empty()
? ::format("failed status returned from {}", id)
: ::format("failed status returned from {}: {}", id, result.error_message);
co_await coroutine::exception(std::make_exception_ptr(
std::runtime_error(::format("failed status returned from {}", id))));
std::runtime_error(std::move(msg))));
}
};
@@ -1553,6 +1558,7 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
background_action_holder cleanup;
background_action_holder repair;
background_action_holder repair_update_compaction_ctrl;
background_action_holder restore;
std::unordered_map<locator::tablet_transition_stage, background_action_holder> barriers;
// Record the repair_time returned by the repair_tablet rpc call
db_clock::time_point repair_time;
@@ -2325,6 +2331,33 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
}
}
break;
case locator::tablet_transition_stage::restore: {
if (!trinfo.restore_cfg.has_value()) {
on_internal_error(rtlogger, format("Cannot handle restore transition without config for tablet {}", gid));
}
if (action_failed(tablet_state.restore)) {
rtlogger.debug("Clearing restore transition for {} due to error", gid);
updates.emplace_back(get_mutation_builder().del_transition(last_token).del_restore_config(last_token).build());
break;
}
if (advance_in_background(gid, tablet_state.restore, "restore", [this, gid, &tmap] () -> future<> {
auto& tinfo = tmap.get_tablet_info(gid.tablet);
auto replicas = tinfo.replicas;
rtlogger.info("Restoring tablet={} on {}", gid, replicas);
co_await coroutine::parallel_for_each(replicas, [this, gid] (locator::tablet_replica r) -> future<> {
auto dst = raft::server_id(r.host.uuid());
if (!is_excluded(dst)) {
co_await ser::sstables_loader_rpc_verbs::send_restore_tablet(&_messaging, r.host, dst, gid);
rtlogger.debug("Tablet {} restored on {}", gid, r.host);
}
});
})) {
rtlogger.debug("Clearing restore transition for {}", gid);
updates.emplace_back(get_mutation_builder().del_transition(last_token).del_restore_config(last_token).build());
}
}
break;
case locator::tablet_transition_stage::end_repair: {
if (do_barrier()) {
if (tablet_state.session_id.uuid().is_null()) {
@@ -2511,6 +2544,8 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
break;
case locator::tablet_transition_kind::repair:
[[fallthrough]];
case locator::tablet_transition_kind::restore:
[[fallthrough]];
case locator::tablet_transition_kind::intranode_migration:
break;
}
@@ -3811,6 +3846,9 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
on_internal_error(rtlogger, ::format("Leaving node {} doesn't own tokens", node.id));
}
// Leave break point. For testing decommission
co_await utils::get_local_injector().inject("topology_coordinator_before_leave", utils::wait_for_message(std::chrono::minutes(2)));
auto validation_result = validate_removing_node(_db, to_host_id(node.id));
if (std::holds_alternative<node_validation_failure>(validation_result)) {
builder.with_node(node.id)
@@ -3909,10 +3947,15 @@ class topology_coordinator : public endpoint_lifecycle_subscriber
throw;
} catch (seastar::abort_requested_exception&) {
throw;
} catch (const std::exception& e) {
rtlogger.error("send_raft_topology_cmd(stream_ranges) failed with exception"
" (node state is rebuilding): {}", e);
rtbuilder.done(e.what());
retake = true;
} catch (...) {
rtlogger.error("send_raft_topology_cmd(stream_ranges) failed with exception"
" (node state is rebuilding): {}", std::current_exception());
rtbuilder.done("streaming failed");
rtbuilder.done("unknown error");
retake = true;
}
if (retake) {
@@ -4265,7 +4308,7 @@ future<std::optional<group0_guard>> topology_coordinator::maybe_migrate_system_t
// it's in `topology_coordinator::enable_features` ,so topology_coordinator will re-run its loop
// and `maybe_migrate_system_tables` will be called.
if (_feature_service.driver_service_level) {
if (_feature_service.driver_service_level && !utils::get_local_injector().enter("skip_service_levels_v2_initialization")) {
const auto sl_driver_created = co_await _sys_ks.get_service_level_driver_created();
if (!sl_driver_created.value_or(false)) {
co_return co_await _sl_controller.migrate_to_driver_service_level(std::move(guard), _sys_ks);

View File

@@ -318,6 +318,9 @@ struct raft_topology_cmd_result {
success
};
command_status status = command_status::fail;
// Carries the error description back to the topology coordinator
// when the command fails.
sstring error_message;
};
// This class is used in RPC's signatures to hold the topology_version of the caller.

View File

@@ -115,7 +115,7 @@ public:
if (buf.size() != chunk_size) {
auto actual_end = _underlying_pos + buf.size();
if (chunk_index + 1 < _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Checksummed reader hit premature end-of-file at file offset {}: expected {} chunks of size {} but data file has {}",
throw_malformed_sstable_exception(seastar::format("Checksummed reader hit premature end-of-file at file offset {}: expected {} chunks of size {} but data file has {}",
actual_end, _checksum.checksums.size(), chunk_size, chunk_index + 1));
} else if (actual_end < _file_len) {
// Truncation on last chunk. Update _end_pos so that future
@@ -124,7 +124,7 @@ public:
}
}
if (chunk_index >= _checksum.checksums.size()) {
throw malformed_sstable_exception(seastar::format("Chunk count mismatch between CRC and Data.db: expected {} but data file has more", _checksum.checksums.size()));
throw_malformed_sstable_exception(seastar::format("Chunk count mismatch between CRC and Data.db: expected {} but data file has more", _checksum.checksums.size()));
}
auto expected_checksum = _checksum.checksums[chunk_index];
auto actual_checksum = ChecksumType::checksum(buf.get(), buf.size());
@@ -231,7 +231,7 @@ input_stream<char> make_checksummed_file_m_format_input_stream(
}
void throwing_integrity_error_handler(sstring msg) {
throw sstables::malformed_sstable_exception(msg);
throw_malformed_sstable_exception(msg);
};
}

View File

@@ -158,7 +158,7 @@ void compression::segmented_offsets::state::update_position_trackers(std::size_t
void compression::segmented_offsets::init(uint32_t chunk_size) {
if (chunk_size == 0) {
throw sstables::malformed_sstable_exception("Segmented offsets chunk size is zero.");
throw_malformed_sstable_exception("Segmented offsets chunk size is zero.");
}
_chunk_size = chunk_size;
@@ -373,11 +373,11 @@ public:
throw std::runtime_error(format("compressed reader not aligned to chunk boundary: pos={} offset={}", _pos, addr.offset));
}
if (!addr.chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed chunk_len must be greater than zero, chunk_start={}", addr.chunk_start));
sstables::throw_malformed_sstable_exception(format("compressed chunk_len must be greater than zero, chunk_start={}", addr.chunk_start));
}
auto buf = co_await _input_stream->read_exactly(addr.chunk_len);
if (buf.size() != addr.chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _underlying_pos, addr.chunk_len, buf.size()));
sstables::throw_malformed_sstable_exception(format("compressed reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _underlying_pos, addr.chunk_len, buf.size()));
}
auto res_units = co_await _permit.request_memory(_compression_metadata->uncompressed_chunk_length());
// The last 4 bytes of the chunk are the adler32/crc32 checksum
@@ -388,7 +388,7 @@ public:
auto expected_checksum = read_be<uint32_t>(buf.get() + compressed_len);
auto actual_checksum = ChecksumType::checksum(buf.get(), compressed_len);
if (expected_checksum != actual_checksum) {
throw sstables::malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", addr.chunk_len, _underlying_pos, expected_checksum, actual_checksum));
sstables::throw_malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", addr.chunk_len, _underlying_pos, expected_checksum, actual_checksum));
}
if constexpr (check_digest) {
@@ -420,7 +420,7 @@ public:
if (_digests.can_calculate_digest
&& _pos == _compression_metadata->uncompressed_file_length()
&& _digests.expected_digest != _digests.actual_digest) {
throw sstables::malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
sstables::throw_malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
}
co_return make_tracked_temporary_buffer(std::move(out), std::move(res_units));
@@ -511,20 +511,20 @@ public:
auto chunk_len = get_chunk_len(_current_chunk_index);
if (!chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed raw reader chunk_len must be greater than zero, pos={}", _pos));
sstables::throw_malformed_sstable_exception(format("compressed raw reader chunk_len must be greater than zero, pos={}", _pos));
}
auto res_units = co_await _permit.request_memory(chunk_len);
auto buf = co_await _input_stream->read_exactly(chunk_len);
if (buf.size() != chunk_len) {
throw sstables::malformed_sstable_exception(format("compressed raw reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _pos, chunk_len, buf.size()));
sstables::throw_malformed_sstable_exception(format("compressed raw reader hit premature end-of-file at file offset {}, expected chunk_len={}, actual={}", _pos, chunk_len, buf.size()));
}
auto compressed_len = chunk_len - 4;
auto expected_checksum = read_be<uint32_t>(buf.get() + compressed_len);
auto actual_checksum = crc32_utils::checksum(buf.get(), compressed_len);
if (expected_checksum != actual_checksum) {
throw sstables::malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", chunk_len, _pos, expected_checksum, actual_checksum));
sstables::throw_malformed_sstable_exception(format("compressed chunk of size {} at file offset {} failed checksum, expected={}, actual={}", chunk_len, _pos, expected_checksum, actual_checksum));
}
if constexpr (check_digest) {
@@ -543,7 +543,7 @@ public:
if (_digests.can_calculate_digest
&& _current_chunk_index == _compression_metadata->offsets.size()
&& _digests.expected_digest != _digests.actual_digest) {
throw sstables::malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
sstables::throw_malformed_sstable_exception(seastar::format("Digest mismatch: expected={}, actual={}", _digests.expected_digest, _digests.actual_digest));
}
}

View File

@@ -363,7 +363,7 @@ static std::optional<std::vector<std::byte>> dict_from_options(const sstables::c
auto i = std::stoi(k_str.substr(DICTIONARY_OPTION.size()));
parts.emplace(i, v.value);
} catch (const std::exception& e) {
throw sstables::malformed_sstable_exception(fmt::format("Corrupted dictionary option: {}", k_str));
sstables::throw_malformed_sstable_exception(fmt::format("Corrupted dictionary option: {}", k_str));
}
}
auto v_str = sstring(v.value.begin(), v.value.end());
@@ -372,7 +372,7 @@ static std::optional<std::vector<std::byte>> dict_from_options(const sstables::c
int i = 0;
for (const auto& [k, v] : parts) {
if (k != i) {
throw sstables::malformed_sstable_exception(fmt::format("Missing dictionary part: expected {}, got {}", i, k));
sstables::throw_malformed_sstable_exception(fmt::format("Missing dictionary part: expected {}, got {}", i, k));
}
++i;
auto s = std::as_bytes(std::span(v));

View File

@@ -48,4 +48,30 @@ struct bufsize_mismatch_exception : malformed_sstable_exception {
{}
};
// Controls whether malformed sstable errors abort the process (generating a coredump) or throw an
// exception. Aborting is useful when the malformed sstable error is caused by memory corruption
// rather than actual sstable corruption, as it allows post-mortem analysis of the coredump.
// Controlled by the --abort-on-malformed-sstable-error command-line option.
// Returns the previous value of the flag.
bool set_abort_on_malformed_sstable_error(bool value) noexcept;
bool abort_on_malformed_sstable_error() noexcept;
// Use these helpers instead of directly throwing malformed_sstable_exception or
// bufsize_mismatch_exception. They check the abort_on_malformed_sstable_error flag and either
// abort the process (with logging) or throw the appropriate exception.
[[noreturn]] void throw_malformed_sstable_exception(sstring msg);
[[noreturn]] void throw_malformed_sstable_exception(sstring msg, component_name filename);
[[noreturn]] void throw_bufsize_mismatch_exception(size_t size, size_t expected);
// Disables aborting on malformed sstable errors for a scope.
//
// Intended for tests which intentionally corrupt sstables and expect
// malformed_sstable_exception to be thrown rather than the process aborting.
class scoped_no_abort_on_malformed_sstable_error {
bool _prev;
public:
scoped_no_abort_on_malformed_sstable_error() noexcept;
~scoped_no_abort_on_malformed_sstable_error();
};
}

View File

@@ -191,10 +191,10 @@ private:
public:
void verify_end_state() const {
if (this->_remain > 0) {
throw malformed_sstable_exception(fmt::format("index_consume_entry_context (state={}): parsing ended but there is unconsumed data", _state), _sst.index_filename());
throw_malformed_sstable_exception(fmt::format("index_consume_entry_context (state={}): parsing ended but there is unconsumed data", _state), _sst.index_filename());
}
if (_state != state::KEY_SIZE && _state != state::START) {
throw malformed_sstable_exception(fmt::format("index_consume_entry_context (state={}): cannot finish parsing current entry, no more data", _state), _sst.index_filename());
throw_malformed_sstable_exception(fmt::format("index_consume_entry_context (state={}): cannot finish parsing current entry, no more data", _state), _sst.index_filename());
}
}
@@ -544,7 +544,7 @@ private:
bound.current_index_idx = 0;
bound.current_pi_idx = 0;
if (bound.current_list->empty()) {
throw malformed_sstable_exception(format("missing index entry for summary index {} (bound {})", summary_idx, fmt::ptr(&bound)), _sstable->index_filename());
throw_malformed_sstable_exception(format("missing index entry for summary index {} (bound {})", summary_idx, fmt::ptr(&bound)), _sstable->index_filename());
}
bound.data_file_position = bound.current_list->_entries[0].position();
bound.element = indexable_element::partition;

View File

@@ -176,7 +176,7 @@ public:
} else if (clustering.size() == (expected_normal + 1)) {
return true;
}
throw malformed_sstable_exception(format("Found {:d} clustering elements in column name. Was not expecting that!", clustering.size()));
throw_malformed_sstable_exception(format("Found {:d} clustering elements in column name. Was not expecting that!", clustering.size()));
}
static bool check_static(const schema& schema, bytes_view col) {
@@ -210,12 +210,12 @@ public:
if (is_static) {
for (auto& e: clustering) {
if (e.size() != 0) {
throw malformed_sstable_exception("Static row has clustering key information. I didn't expect that!");
throw_malformed_sstable_exception("Static row has clustering key information. I didn't expect that!");
}
}
}
if (is_present && is_static != cdef->is_static()) {
throw malformed_sstable_exception(seastar::format("Mismatch between {} cell and {} column definition",
throw_malformed_sstable_exception(seastar::format("Mismatch between {} cell and {} column definition",
is_static ? "static" : "non-static", cdef->is_static() ? "static" : "non-static"));
}
}
@@ -577,20 +577,20 @@ public:
[] (const collection_type_impl& ctype) -> const abstract_type& { return *ctype.value_comparator(); },
[&] (const user_type_impl& utype) -> const abstract_type& {
if (col.collection_extra_data.size() != sizeof(int16_t)) {
throw malformed_sstable_exception(format("wrong size of field index while reading UDT column: expected {}, got {}",
throw_malformed_sstable_exception(format("wrong size of field index while reading UDT column: expected {}, got {}",
sizeof(int16_t), col.collection_extra_data.size()));
}
auto field_idx = deserialize_field_index(col.collection_extra_data);
if (field_idx >= utype.size()) {
throw malformed_sstable_exception(format("field index too big while reading UDT column: type has {} fields, got {}",
throw_malformed_sstable_exception(format("field index too big while reading UDT column: type has {} fields, got {}",
utype.size(), field_idx));
}
return *utype.type(field_idx);
},
[] (const abstract_type& o) -> const abstract_type& {
throw malformed_sstable_exception(format("attempted to read multi-cell column, but expected type was {}", o.name()));
throw_malformed_sstable_exception(format("attempted to read multi-cell column, but expected type was {}", o.name()));
}
));
auto ac = make_atomic_cell(value_type,
@@ -708,7 +708,7 @@ public:
case composite::eoc::end:
return bound_kind::excl_start;
}
throw malformed_sstable_exception(format("Unexpected start composite marker {:d}", uint16_t(uint8_t(found))));
throw_malformed_sstable_exception(format("Unexpected start composite marker {:d}", uint16_t(uint8_t(found))));
}
static bound_kind end_marker_to_bound_kind(bytes_view component) {
@@ -723,7 +723,7 @@ public:
case composite::eoc::end:
return bound_kind::incl_end;
}
throw malformed_sstable_exception(format("Unexpected end composite marker {:d}", uint16_t(uint8_t(found))));
throw_malformed_sstable_exception(format("Unexpected end composite marker {:d}", uint16_t(uint8_t(found))));
}
// Consume one range tombstone.
@@ -1050,7 +1050,7 @@ private:
} else {
// FIXME: see ColumnSerializer.java:deserializeColumnBody
if ((mask & column_mask::counter_update) != column_mask::none) {
throw malformed_sstable_exception("FIXME COUNTER_UPDATE_MASK");
throw_malformed_sstable_exception("FIXME COUNTER_UPDATE_MASK");
}
_ttl = _expiration = 0;
_deleted = (mask & column_mask::deletion) != column_mask::none;
@@ -1062,7 +1062,7 @@ private:
mp_row_consumer_k_l::proceed ret;
if (_deleted) {
if (_val_fragmented.size_bytes() != 4) {
throw malformed_sstable_exception("deleted cell expects local_deletion_time value");
throw_malformed_sstable_exception("deleted cell expects local_deletion_time value");
}
_val = temporary_buffer<char>(4);
auto v = fragmented_temporary_buffer::view(_val_fragmented);
@@ -1110,7 +1110,7 @@ public:
return;
}
if (_state != state::ROW_START || data_consumer::primitive_consumer::active()) {
throw malformed_sstable_exception("end of input, but not end of row");
throw_malformed_sstable_exception("end of input, but not end of row");
}
}
@@ -1249,7 +1249,7 @@ private:
}
if (!_consumer.is_mutation_end()) {
throw malformed_sstable_exception(format("consumer not at partition boundary, position: {}",
throw_malformed_sstable_exception(format("consumer not at partition boundary, position: {}",
position_in_partition_view::printer(*_schema, _consumer.position())), _sst->get_filename());
}
@@ -1442,7 +1442,7 @@ public:
try {
f.get();
} catch(sstables::malformed_sstable_exception& e) {
throw sstables::malformed_sstable_exception(format("Failed to read partition from SSTable {} due to {}", _sst->get_filename(), e.what()));
throw_malformed_sstable_exception(format("Failed to read partition from SSTable {} due to {}", _sst->get_filename(), e.what()));
}
});
}

View File

@@ -17,7 +17,7 @@ namespace sstables {
static void check_buf_size(temporary_buffer<char>& buf, size_t expected) {
if (buf.size() < expected) {
throw bufsize_mismatch_exception(buf.size(), expected);
throw_bufsize_mismatch_exception(buf.size(), expected);
}
}

Some files were not shown because too many files have changed in this diff Show More