Compare commits

...

353 Commits

Author SHA1 Message Date
Yaniv Michael Kaul
9197c9466c test: add cross-partition static column assertion to testStaticColumnsWithSecondaryIndex
Add a second partition (k=1) with a different static value (s=99) and
verify that a secondary index query returns the correct static column
values across partitions. This covers the gap identified in
dtest cql_static_columns_tests.py, allowing its removal.

Refs: SCYLLADB-1922
2026-05-11 18:32:24 +03:00
Botond Dénes
9b2dfab2e5 Merge 'Don't use database.get_config() to fetch calculate_view_update_throttling_delay option' from Pavel Emelyanov
This option is used in two places -- proxy and view-update-generator both need it to calculate the calculate_view_update_throttling_delay() value. This PR moves the option onto view_update_backlog top-level service, makes the calculating helper be method of that class and patches the callers to use it. This eliminates more places that abuse database as db::config accessor.

Code dependencies refactoring, not backporting

Closes scylladb/scylladb#29635

* github.com:scylladb/scylladb:
  view: Turn calculate_view_update_throttling_delay into node_update_backlog member
  view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
  view: Add node_update_backlog reference to view_update_generator
2026-05-11 10:30:24 +03:00
Pavel Emelyanov
f39cbb1ec6 storage_proxy: Move maintenance_mode onto storage_proxy::config
Stop reading maintenance_mode through replica::database's db::config.
Add a properly typed maintenance_mode_enabled field to
storage_proxy::config, populate it in main.cc from cfg->maintenance_mode()
(same as messaging_service::config), and use a cached member in
storage_proxy instead of db.local().get_config().maintenance_mode().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29637
2026-05-11 10:11:20 +03:00
Yaniv Michael Kaul
631f1e1654 compaction: set_skip_when_empty() for validation_errors metric
Add .set_skip_when_empty() to compaction_manager::validation_errors.
This metric only increments when scrubbing encounters out-of-order or
invalid mutation fragments in SSTables, indicating data corruption.
It is almost always zero and creates unnecessary reporting overhead.

AI-Assisted: yes
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>

Closes scylladb/scylladb#29349
2026-05-11 09:12:40 +03:00
Yaniv Michael Kaul
b8a150e22c build: add -ftime-trace support for compilation profiling
Add a --time-trace flag to configure.py and a Scylla_TIME_TRACE CMake
option that enable Clang's -ftime-trace on all C++ compilations. When
enabled, each .o file produces a companion .json trace that can be
analyzed with ClangBuildAnalyzer or loaded in chrome://tracing to
identify slow headers and costly template instantiations.

This is the first step toward data-driven build speed improvements.

Refs #1

Usage:
  configure.py:  ./configure.py --time-trace --mode dev
  CMake:         cmake -DScylla_TIME_TRACE=ON -DCMAKE_BUILD_TYPE=Dev ..

Closes scylladb/scylladb#29462
2026-05-11 08:55:33 +03:00
Dmitry Kropachev
85d0011b3c gitignore: add missing rust build artifacts
rust/**/target and Cargo.lock files under rust/inc/ and
rust/wasmtime_bindings/ were not ignored, nor was
test/resource/wasm/rust/target/.

Closes scylladb/scylladb#28943
2026-05-11 07:06:26 +03:00
Botond Dénes
3f72852d8c Merge 'Fix missing format string placeholders across the codebase (33 bugs across 14 modules )' from Yaniv Kaul
Fix 28 format string bugs plus 5 related format argument bugs across 14 modules
where `{}` placeholders were missing or arguments were wrong, causing arguments to
be silently dropped or misleading output from the `{fmt}` library.

Inspired by https://github.com/scylladb/scylladb/pull/29143 (which fixed a single
instance in `replica/table.cc`), a comprehensive audit of the entire codebase was
performed to find all similar issues.

- **Missing `{}` placeholder** (21 instances): format string simply lacks `{}` for a
  passed argument, e.g. `format("msg for table {}", group_id, table_id)` -- `group_id`
  is silently dropped
- **Spurious comma breaking C++ string literal concatenation** (2 instances): a comma
  after a string literal prevents adjacent-literal concatenation, turning the
  continuation into a format argument instead of part of the format string
- **Printf-style `%s` in fmtlib context** (4 instances): `%s` has no meaning in fmtlib
  and appears as literal text while the argument is silently ignored
- **Extra spurious argument** (1 instance): an extraneous `t.tomb()` argument inserted
  between correct arguments, causing wrong values in the wrong slots

- **Wrong variable in error message** (4 instances in `types/map.hh`): error messages
  for oversized map keys/values reported `map_size` (total entry count) instead of the
  actual `elem.first.size()` or `elem.second.size()` that exceeded the limit
- **Swapped argument order** (1 instance in `data_dictionary/data_dictionary.cc`):
  format string says `"Extraneous options for {type}: {values}"` but the values and
  type arguments were passed in reverse order

| Module | Bugs Fixed | Files |
|--------|:---------:|-------|
| `replica/` | 1 | `table.cc` |
| `service/` | 4 | `raft_group0.cc`, `storage_service.cc` |
| `db/` | 6 | `heat_load_balance.cc`, `commitlog_replayer.cc`, `view_update_generator.cc`, `view_building_worker.cc`, `row_locking.cc` |
| `cql3/` | 2 | `prepare_expr.cc`, `statement_restrictions.cc` |
| `transport/` | 4 | `event_notifier.cc` |
| `sstables/` | 3 | `partition_reversing_data_source.cc`, `reader.cc` |
| `alternator/` | 1 | `conditions.cc` |
| `cdc/` | 1 | `split.cc` |
| `raft/` | 1 | `server.cc` |
| `utils/` | 2 | `gcp/object_storage.cc`, `s3/client.cc` |
| `mutation/` | 1 | `mutation_partition.hh` |
| `ent/` | 2 | `kmip_host.cc`, `kms_host.cc` |
| `types/` | 4 | `map.hh` |
| `data_dictionary/` | 1 | `data_dictionary.cc` |

The `{fmt}` library's compile-time checker validates that each `{}` placeholder
references a valid argument, but does **not** verify the reverse -- that every
argument has a corresponding placeholder. Extra arguments are silently ignored
at both compile time and runtime.

Build verified with `dbuild ninja build/dev/scylla` -- compiles cleanly.

---

**Note:** Commits were amended to fix the author name from "Yaniv Michael Kaul" to "Yaniv Kaul".

Closes scylladb/scylladb#29448

* github.com:scylladb/scylladb:
  data_dictionary: fix swapped arguments in extraneous options error
  types: fix wrong variable in map key/value size error messages
  ent: fix missing format placeholders in encryption error/log messages
  mutation: fix spurious argument in shadowable_tombstone formatter
  utils: fix missing format placeholders in object storage log messages
  raft: fix missing format placeholder in server ostream operator
  cdc: fix missing format placeholder in error message
  alternator: fix missing format placeholder in error message
  sstables: fix missing format placeholders in error messages
  transport: fix printf-style format specifiers in fmtlib log calls
  cql3: fix missing format placeholders in error messages
  db: fix missing format placeholders in log and error messages
  service: fix missing format placeholders in log messages
  replica: fix missing format placeholder in cleanup log message
2026-05-11 07:04:42 +03:00
Yaron Kaikov
5694c93c12 build: add collect-dist target to organize build artifacts
Build artifacts are currently scattered across
build/dist/$mode/redhat/, tools/python3/build/, tools/cqlsh/build/, etc. with unpredictable names. Add a new 'collect-dist' ninja target that
gathers all distributable artifacts into a well-known structure:

  build/$mode/dist/rpm/       -- all binary RPMs (no SRPMs)
  build/$mode/dist/deb/       -- all .deb packages
  build/$mode/dist/tar/       -- relocatable tarballs (already here)

The collection is done via a reusable 'collect_pkgs' ninja rule defined
directly in configure.py, which knows all the source paths. No external
script is needed.

Fixes: SCYLLADB-75

Closes scylladb/scylladb#29475
2026-05-11 06:54:29 +03:00
Michael Litvak
274024a76b configure.py: update compile_commands.json if stale
configure.py creates compile_commands.json in the root directory as a
symbolic link to the file in one of the build directories. If the file
already exists it does nothing.

However it may happen that the file exists but the target file does not
exist. For example, if the build directory is removed and then building
with a different mode. Then the file will remain as a stale symbolic
link.

To address this, when the file exists check also if it's a valid
symbolic link. If not, then recreate it with a valid target.

Closes scylladb/scylladb#29680
2026-05-10 22:17:16 +03:00
Piotr Szymaniak
459c1dc32f test/alternator: stop avoiding tablets in Streams tests
Alternator Streams now supports tablets, so stop skipping the TTL Streams test in tablet mode and stop forcing vnodes in the Streams audit test.

Refs SCYLLADB-463

Closes scylladb/scylladb#29697
2026-05-10 22:13:15 +03:00
Nadav Har'El
df8c9b17b8 Merge 'alternator: Graduate Alternator Streams from experimental' from Piotr Szymaniak
As a final step for https://scylladb.atlassian.net/browse/SCYLLADB-461 we need to graduate Alternator Streams from experimental.
So let's remove `--experimental-features=alternator-streams` and map the obsolete config string to `UNUSED` for backward compatibility. Also, remove the related gating of the feature.
Finally, stop providing the config flag in test configs.

Fixes SCYLLADB-1680
Fixes #16367

To documentation tracked by https://scylladb.atlassian.net/browse/SCYLLADB-462 still remains.

This PR needs to hit 2026.2, so (only) if it branches before the PR is merged to `master`, we'd need to backport.

Closes scylladb/scylladb#29604

* github.com:scylladb/scylladb:
  test: Stop providing alternator-streams experimental flag
  alternator: Graduate Alternator Streams from experimental
2026-05-10 22:10:03 +03:00
Nadav Har'El
34136d3bc2 Merge 'vector_search: test: migrate CQL tests for vector search from C++/Boost to pytest' from Karol Nowacki
Migrate vector search (ANN ordered select query) CQL tests from C++/Boost suite to pytest.

This migration includes:
- New pytest tests in `test/cqlpy/test_vector_search_with_vector_store_mock.py`
- VectorStoreMock server as pytest fixture to simulate vector store responses

The benefits of this migration are:
- Extended test coverage to verify CQL protocol serialization and driver
- Reduced overall test time (no compilation required for pytest)

Fixes SCYLLADB-695

No backport needed as this is a refactoring.

Closes scylladb/scylladb#29593

* github.com:scylladb/scylladb:
  vector_search: test: migrate paging warnings tests to Python
  vector_search: test: migrate local_vector_index to Python
  vector_search: test: migrate vector_index_with_additional_filtering_column to Python
  vector_search: test: migrate cql_error_contains_http_error_description to Python
  vector_search: test: migrate pk in restriction test to Python
2026-05-10 22:09:17 +03:00
Nadav Har'El
d4aa528834 Merge 'load_balancer: fix tablet allocator dropped table' from Ferenc Szili
- Handle dropped tables gracefully in the tablet load balancer's `get_schema_and_rs()` instead of aborting with `on_internal_error`
- The load balancer operates on a token metadata snapshot but accesses the live schema for table lookups. A DROP TABLE applied by another fiber between coroutine yield points can remove a table from the live schema while it still exists in the snapshot, causing an abort.

`get_schema_and_rs()` now returns `std::optional` and logs a warning in debug log level instead of aborting when a table is missing. All callers skip dropped tables:
- `make_sizing_plan`: skips to next table
- `make_resize_plan`: skips to next table (merge suppression is moot)
- `check_constraints`: returns `skip_info{}` with empty viable targets
- `get_rs`: returns `nullptr`, checked by `check_constraints`

The call chain is: `make_plan` → `make_internode_plan` → `check_constraints` → `get_rs` → `get_schema_and_rs`. The `make_internode_plan` coroutine has multiple `co_await` yield points (`maybe_yield`, `pick_candidate`) between building the candidate tablet list and checking replication constraints. A DROP TABLE schema mutation applied during any of these yields removes the table from `_db.get_tables_metadata()` while the candidate list still references it.

Added `test_load_balancing_with_dropped_table` which simulates the race by capturing a token metadata snapshot, dropping the table, then calling `balance_tablets` with the stale snapshot.

Fixes: SCYLLADB-1664

This fix needs to be backported to versions: 2025.4, 2026.1

Closes scylladb/scylladb#29585

* github.com:scylladb/scylladb:
  test: verify load balancer handles dropped tables gracefully
  tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
2026-05-10 22:07:51 +03:00
Nadav Har'El
63927e07ea Merge 'alternator/streams: keep disabled streams usable and purge on re-enable' from Piotr Szymaniak
When an Alternator stream is disabled, the data should continue to be accessible so that consumers can finish reading. When the stream is later re-enabled, a new StreamArn is produced and only then the old data is purged.

On disable, the existing CDC options (including preimage and postimage) are preserved so that DescribeStream can still report StreamViewType. All stream APIs continue to work on the disabled stream, with all shards reported as closed (EndingSequenceNumber set). No new CDC records are written; existing data expires via TTL after 24 hours.

On re-enable, the old CDC log table is dropped as a separate Raft group0 schema change and a fresh one is created with a new UUID, giving a new StreamArn. This is Alternator-specific — CQL CDC keeps reusing the log table. Re-enabling is the only way to immediately purge old stream data.

Old stream data is removed immediately upon re-enable (a discrepancy with DynamoDB, which keeps it readable for 24 hours through the old StreamArn).

Tests updated to cover the new disable and re-enable behavior.

Fixes #7239
Fixes SCYLLADB-523

Closes scylladb/scylladb#29413

* github.com:scylladb/scylladb:
  alternator/streams: remove dead next_iter in get_records
  test/alternator: fix stream wait timeouts to use wall-clock time
  docs/alternator: document stream disable/re-enable behavior
  alternator/streams: keep disabled streams usable and purge on re-enable
2026-05-10 22:04:35 +03:00
Nadav Har'El
e277f747bd Merge 'Make collection unfreezing more efficient' from Botond Dénes
Introduce `read_from_collection_cell_view()` which reads a `collection_mutation` directly from the IDL representation of a collection (`ser::collection_cell_view`). This cuts down the number of allocations required drastically compared to the current method of:

    IDL -> collection_mutatio_description -> collection_mutation

Reduces the number of allocations to unfreeze a collection from O(collection_cell_count) -> O(1) (actually, due to buffer fragmentation, it is O(collection_size)).
The new method is used when unfreezing frozen mutations and frozen mutation fragments. This is on the hot path: all writes with collections benefit.

Add a `--collection` flag to `perf-simple-query` to allow measuring the performance improvement of this PR.
With  `dbuild -it -- build/release/scylla perf-simple-query --collection=16 -c1 -m2G --default-log-level=error --write`  the number of allocations drop from ~123 to 102, which is a significant amount of allocations shaved off.

Refs: https://github.com/scylladb/scylladb/issues/3602 (solves one use-case out of the many listed therein)
Fixes: SCYLLADB-1046
Fixes: SCYLLADB-1077

Backport: this is an optimization so normally not a backport candidate, but we may have to backport to relieve certain customers

Closes scylladb/scylladb#29033

* github.com:scylladb/scylladb:
  test/perf/perf_simple_query: add --collection=N
  test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections
  mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
  mutation/collection_mutation: introduce read_from_collection_cell_view()
  mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
  mutation/collection_mutation: generalize serialize_collection_mutation
  mutation/mutation_partition_view: avoid copying collection
  mutation/mutation_partition_view: accept collection_mutation in the consume API
  partition_builder: add move variant of accept_*_cell() collection overloads
2026-05-10 20:39:08 +03:00
Yaniv Kaul
a6cf45f9e2 data_dictionary: fix swapped arguments in extraneous options error
The format string says "Extraneous options for {type}: {values}"
but the arguments were passed in the wrong order (values first, type
second), producing misleading error messages like
"Extraneous options for bucket,endpoint: S3" instead of
"Extraneous options for S3: bucket,endpoint".

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a13da94308 types: fix wrong variable in map key/value size error messages
Four error messages for oversized map keys/values reported map_size
(the total number of entries) instead of the actual key or value size
that exceeded the limit. The condition checks elem.first.size() or
elem.second.size(), but the error message printed map_size. This
affects both the bytes and managed_bytes serialization overloads.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
bf1d59ad95 ent: fix missing format placeholders in encryption error/log messages
Fix two format string bugs:

- kmip_host.cc: cmd_in was passed as an argument to a trace log but
  had no {} placeholder, so the command was silently dropped.
- kms_host.cc: the XML node name (what) was passed to the error
  message but had no {} placeholder, so the error never showed which
  XML node was missing.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:20 +03:00
Yaniv Kaul
a76774f8f9 mutation: fix spurious argument in shadowable_tombstone formatter
The formatter for shadowable_tombstone had a spurious t.tomb()
argument between the timestamp and deletion_time arguments. This
caused t.tomb() (the whole tombstone) to be formatted into the
deletion_time={} slot, while the actual deletion_time count was
silently dropped. Remove the extra argument.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
700b0b4c28 utils: fix missing format placeholders in object storage log messages
Fix two format string bugs:

- gcp/object_storage.cc: _session_path was passed but the format
  string had empty parentheses () instead of ({}), so the session
  path was silently dropped from the debug output.
- s3/client.cc: part_number was passed as an argument but had no {}
  placeholder. The upload_id ended up in the etag slot and was
  silently dropped. Add {} for all three values.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
358f6fba9f raft: fix missing format placeholder in server ostream operator
The FSM state was passed as an argument but the format string had
empty parentheses () instead of ({}), causing the FSM state to be
silently dropped from the output.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
605455f82d cdc: fix missing format placeholder in error message
The collection type name was passed as an argument but the format
string only had a trailing colon without a {} placeholder, so the
type name was silently dropped from the error message.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
0c88ff6a40 alternator: fix missing format placeholder in error message
The values count was passed as an argument but had no {} placeholder,
so it was silently dropped. The analogous BETWEEN check on the line
above correctly uses {} -- apply the same pattern here.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
e29f59347b sstables: fix missing format placeholders in error messages
Fix three format string bugs:

- partition_reversing_data_source.cc: _row_start was passed as an
  argument but had no {} placeholder in the invariant error message.
  Add {} for all three values to show the full diagnostic.
- reader.cc: two "Invalid boundary type" error messages passed the
  type value as an argument but had no {} placeholder, so the actual
  invalid type was never shown.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
413497c9ce transport: fix printf-style format specifiers in fmtlib log calls
Four logger calls used %s (printf-style) instead of {} (fmtlib-style),
causing __func__ to be silently ignored and the literal text "%s" to
appear in the log output. The same file already uses {} correctly in
the on_create_function and on_create_aggregate handlers.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
cfb568b5b5 cql3: fix missing format placeholders in error messages
Fix two format string bugs where arguments were silently dropped:

- prepare_expr.cc: the bad argument to count() was passed but had no
  {} placeholder, so users never saw what was actually passed.
- statement_restrictions.cc: the unsupported multi-column relation was
  passed but the trailing colon had no {} placeholder.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:51:19 +03:00
Yaniv Kaul
fdebed5746 db: fix missing format placeholders in log and error messages
Fix six format string bugs where arguments were silently dropped:

- heat_load_balance.cc: pp value was passed but had no {} placeholder.
- commitlog_replayer.cc: column_family_id was passed but table= had
  no {} placeholder.
- view_update_generator.cc: _sstables_with_tables.size() was passed
  but had no {} placeholder.
- view_building_worker.cc: exception pointer was passed but the
  trailing colon had no {} placeholder.
- row_locking.cc: partition key and clustering key were passed in
  error messages but had no {} placeholders.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Yaniv Kaul
4ee81f9b32 service: fix missing format placeholders in log messages
Fix four format string bugs:

- raft_group0.cc: the exception from sleep_and_abort was passed as an
  argument but had no {} placeholder, so it was silently dropped.
- storage_service.cc: loading topology trace was missing a placeholder
  for the cleanup field (9 args but only 8 placeholders).
- storage_service.cc: two join-rejection warnings had a spurious comma
  after the first string literal, breaking C++ string concatenation.
  This caused the continuation string to be treated as a separate
  format argument instead of being part of the format string, and
  params.host_id was silently dropped.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Yaniv Kaul
f75248a734 replica: fix missing format placeholder in cleanup log message
The log message for tablet cleanup invalidation was missing a {}
placeholder for the table name (cf_name), causing it to be silently
dropped from the output. Add {}.{} to show both keyspace and table
name, consistent with the convention used elsewhere in the file.

Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
2026-05-10 17:49:50 +03:00
Piotr Dulikowski
bc482bfdea database: add missing co_await on lock in create_local_system_table
The function database::create_local_system_table calls
get_tables_metadata().hold_write_lock(), but does not co_await the
returned future. Effectively, this code does not guarantee mutual
exclusion because it does not wait for the lock to be acquired and does
not guarantee that the lock is held long enough.

Fix this by adding the co_await that was missing.

Found by manual inspection. This code is not known to have caused any
problems so far, but it's clearly wrong - hence the fix.

Closes scylladb/scylladb#29806
2026-05-10 15:36:21 +03:00
Avi Kivity
5a887362e3 Merge 'Remove legacy tables creation code' from Gleb Natapov
Drop creation of `service_levels` and `cdc_generation_descriptions_v2` table creation code since they are no longer needed. Old clusters will still have it because they were created earlier. Also the series contains a small improvement around group0 creation.

No backport needed since this removes functionality.

Closes scylladb/scylladb#29482

* github.com:scylladb/scylladb:
  db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused
  db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
  db/system_distributed_keyspace: remove unused code
  db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
  db/system_distributed_keyspace: drop old service_levels table
  fix indent after the previous patch
  group0: call setup_group0 only when needed
2026-05-10 14:46:21 +03:00
Botond Dénes
67226e6f1b scylla-gdb.py: interval_printer: update for new layout
interval switched from std::optional<> to union + bools for bound
storage in 42d7ae1082.
Update the printer to work with the new layout. Keep the code
backwards compatible, 2025.1 still uses optionals and is still
supported.

Closes scylladb/scylladb#29738
2026-05-10 14:28:24 +03:00
Avi Kivity
ece4e0738f Merge 'docs/cql: fix syntax errors in CQL examples' from Yaniv Kaul
Fix 4 genuine CQL syntax errors in documentation examples, found by automated extraction and execution of doc code blocks against a live ScyllaDB instance.

- **insert.rst**: `USING TTL 86400 IF NOT EXISTS` → `IF NOT EXISTS USING TTL 86400` (wrong clause order produces syntax error)
- **ddl.rst**: Missing opening quote in ALTER KEYSPACE example (`dc2'` → `'dc2'`)
- **ddl.rst**: Hyphenated column names need double-quoting; also fix PRIMARY KEY referencing non-existent `customer_id` instead of `cust_id`
- **types.rst**: UDT `address` contains nested collections, so it must be `frozen<address>` when used as a column type

Built a CQL extractor that parses `.. code-block:: cql` blocks from RST docs, then executed all 194 extracted statements against ScyllaDB 2026.2.0-rc0. These 4 are confirmed syntax/semantic errors in the documentation.

Closes scylladb/scylladb#29765

* github.com:scylladb/scylladb:
  test/cqlpy: add tests for hyphenated column names
  docs/cql: fix UDT example to use frozen<address>
  docs/cql: fix CREATE TABLE example with hyphenated column names
  docs/cql: fix missing opening quote in ALTER KEYSPACE example
  docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
2026-05-10 14:23:30 +03:00
Anna Stuchlik
61d1cbfd20 doc: add the upgrade guide from 2026.1 to 2026.2
This commit adds the upgrade guide, including the updated metrics.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1746

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1765

Closes scylladb/scylladb#29694
2026-05-10 14:20:09 +03:00
Benny Halevy
a797c9f10b table: delete sstables atomically per compaction group during truncate
Prepare for truncate of tables on object storage, where we want to
limit the atomic deletion batches to produce smaller batch mutations.

This is safe since truncate does not really need to delete all sstables
in the table atomically — it is already non-atomic since each node and
each shard deletes its own sstables. The atomic deletion mechanism is
used for convenience.

Previously, discard_sstables collected all sstables from all compaction
groups on the shard into a single vector and issued one atomic delete
for all of them. Change to track removed sstables per compaction group
and issue separate atomic deletes per group using
coroutine::parallel_for_each, allowing concurrent deletion across
groups.

Closes scylladb/scylladb#29789
2026-05-10 14:08:10 +03:00
Botond Dénes
d0813769ec sstables/trie: add preemption points in trie_writer
The BTI partition index trie writer flushes all buffered nodes at the
end of each SSTable via complete_until_depth(0), called from
bti_partition_index_writer_impl::finish(). This is a tight synchronous
loop that writes trie nodes through file_writer::write(), which uses a
buffered output_stream: individual writes that fit in the buffer are
plain memcpy operations returning a ready future, so .get() never
yields. As a result the reactor can stall for several milliseconds on
large SSTables.

The entire call chain runs inside seastar::async() (via
sstable::write_components()), so seastar::thread::maybe_yield() is
safe to call here. Add it at the top of both tight loops:
- complete_until_depth(), which iterates over trie depth
- lay_out_children(), which iterates over child branches per node

Fixes SCYLLADB-1885

Closes scylladb/scylladb#29798
2026-05-10 11:30:59 +03:00
Botond Dénes
8ca0f2dd54 Merge 'raft: do not throw commit_status_unknown from add_entry when possible' from Patryk Jędrzejczak
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.

Three sites are changed:

1. wait_for_entry (truncation case): the snapshot-term match optimization
   that proved the entry was committed now applies to both wait_type::committed
   and wait_type::applied, not just committed.

2. wait_for_entry (snapshot covers entry): instead of throwing
   commit_status_unknown when the snapshot index >= entry index, return
   successfully. The entry's effects are included in the state machine's
   state via the snapshot.

3. drop_waiters: when called from load_snapshot, pass the snapshot term.
   Waiters whose term matches the snapshot term are resolved successfully
   (set_value) instead of failing with commit_status_unknown, since the
   Log Matching Property guarantees they were committed and included.

This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.

The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.

Fixes: SCYLLADB-1264

No backport: the changes are quite risky and the test being fixed
fails very rarely.

Closes scylladb/scylladb#29685

* github.com:scylladb/scylladb:
  test/raft: fix duplicate check in connected::operator()
  test/raft: add tests for add_entry snapshot interactions
  raft: do not throw commit_status_unknown from add_entry when possible
  raft: change drop_waiters parameter from index to snapshot descriptor
  raft: server: fix a typo
2026-05-08 16:39:52 +03:00
Patryk Jędrzejczak
4c3a86c515 test/raft: fix duplicate check in connected::operator()
The operator had a copy-paste bug: it checked
disconnected.contains({id1, id2}) twice instead of checking both
directions ({id1, id2} and {id2, id1}).

Reduce the operator to a single directional check: {id1, id2}. It works
for all current callers, and checking both directions correctly would
break the new block_receive() function.
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
ccd92c0b6b test/raft: add tests for add_entry snapshot interactions
Add six tests covering add_entry with wait_type::applied and
wait_type::committed for three snapshot scenarios affected in the
previous commit:

1. Snapshot at the entry's index (wait_for_entry, term_for returns
   snapshot term).

2. Snapshot past the entry's index (wait_for_entry, term_for returns
   nullopt).

3. Follower's waiter is resolved via drop_waiters when a snapshot
   is loaded.

Without the fix in the previous commit, 4 of 6 tests fail:
all 3 wait_type::applied tests and the wait_type::committed
drop_waiters test. The remaining two tests pass because the changes
don't affect them.

We don't write tests covering the scenarios when add_entry should
still throw commit_status_unknown (that is when the entry's term
doesn't match the snapshot's term) because:
- these tests would be very complicated,
- a bug that would make these tests fail should also make the
  nemesis tests fail, as there would be an issue with linearizability.
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
a7f204ee45 raft: do not throw commit_status_unknown from add_entry when possible
Previously, when a snapshot load subsumed a committed entry before apply()
was called locally, add_entry would throw commit_status_unknown -- even
though the entry was known to be committed and included in the snapshot.
This was overly pessimistic. Normal state machine implementations
shouldn't care whether an entry was applied via apply() or via a snapshot load.
Unnecessary commit_status_unknown caused flakiness of
test_frequent_snapshotting and unnecessary retries in group0. Raft groups
from strongly consistent tables couldn't hit unnecessary
commit_status_unknown's because they use wait_type::committed and
`enable_forwarding == false`.

Three sites are changed:

1. wait_for_entry (truncation case): the snapshot-term match optimization
   that proved the entry was committed now applies to both wait_type::committed
   and wait_type::applied, not just committed.

2. wait_for_entry (snapshot covers entry): instead of throwing
   commit_status_unknown when the snapshot index >= entry index, return
   successfully. The entry's effects are included in the state machine's
   state via the snapshot.

3. drop_waiters: when called from load_snapshot, pass the snapshot term.
   Waiters whose term matches the snapshot term are resolved successfully
   (set_value) instead of failing with commit_status_unknown, since the
   Log Matching Property guarantees they were committed and included.

This deflakes test_frequent_snapshotting: the test uses aggressive
snapshot settings (snapshot_threshold=1) causing wait_for_entry to
occasionally find the snapshot covering its entry. Previously this
threw commit_status_unknown, failing the test. With this fix,
wait_for_entry returns success. Note that apply() is never actually
skipped in this test -- the leader always applies entries locally
before taking a snapshot.

The nemesis test is updated to handle the new behavior:
call() detects when add_entry succeeded but the output channel was
not written (apply() skipped locally) and returns apply_skipped instead
of hanging. The linearizability checker in basic_generator_test counts
skipped applies separately from failures. basic_generator_test
exercises this path: skipped_applies > 0 occurs in some runs.

Fixes: SCYLLADB-1264
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
e2217c143f raft: change drop_waiters parameter from index to snapshot descriptor
Change drop_waiters(std::optional<index_t> idx) to
drop_waiters(const snapshot_descriptor* snp). The only caller that passes
an index is load_snapshot, which already has the full snapshot descriptor.
Using it directly makes the parameter self-documenting and prepares for the
following commit which will also need the snapshot term (a field of
snapshot_descriptor).
2026-05-08 11:18:02 +02:00
Patryk Jędrzejczak
3219786ab8 raft: server: fix a typo 2026-05-08 11:18:01 +02:00
Botond Dénes
a30ce98bc4 Merge 'test: speed up sstable compaction tests on remote storage (S3/GCS)' from Ernest Zaslavsky
Several sstable_compaction_test cases run prohibitively slowly on S3 and GCS backends — some taking 4+ minutes — because they create hundreds of SSTables sequentially over high-latency HTTP connections and perform redundant validation (checksumming) round-trips on every one. The twcs_reshape_with_disjoint_set S3 variant was even disabled entirely because of this.

The changes apply three complementary optimizations, per-test:

**Skip SSTable validation on remote storage.** The compaction tests verify strategy logic, not data integrity. SSTable validation triggers additional read-back I/O which is cheap on local disk but expensive over HTTP. A `do_validate` flag now conditionally skips validation when the storage backend is not local.

**Parallelize SSTable creation with async coroutines.** A new `make_sstable_containing_async` coroutine overload is added alongside the existing synchronous `make_sstable_containing`. Sequential creation loops are replaced with `parallel_for_each` using coroutine lambdas that call the async overload directly, overlapping S3/GCS uploads without spawning a dedicated Seastar thread per SSTable. The async validation path performs the same content checks as the synchronous version (mutation merging and `is_equal_to_compacted` assertions). Operations that depend on the created SSTables (e.g. `add_sstable_and_update_cache`, `owned_token_ranges` population) remain sequential.

**Reduce SSTable count for remote variants.** Tests like twcs_reshape_with_disjoint_set and stcs_reshape_overlapping used a hardcoded count of 256. The count is now a function parameter (default 256 for local, 64 for S3/GCS), which is sufficient to exercise the compaction strategy logic while avoiding excessive remote I/O.

Infrastructure changes: S3 endpoint max_connections raised from the default to 32 to support the higher upload concurrency, and trace-level logging added for s3, gcp_storage, http, and default_http_retry_strategy to aid future debugging.

The previously disabled twcs_reshape_with_disjoint_set_s3_test is re-enabled with these optimizations.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1428
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1843

No backport needed — this is a test-only performance improvement.

Closes scylladb/scylladb#29416

* github.com:scylladb/scylladb:
  test: optimize compaction_strategy_cleanup_method for remote storage
  test: optimize stcs_reshape_overlapping for remote storage
  test: optimize twcs_reshape_with_disjoint_set for remote storage
  test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
  test: parallelize SSTable creation in run_incremental_compaction_test
  test: parallelize SSTable creation in offstrategy_sstable_compaction
  test: parallelize SSTable creation in twcs_partition_estimate
  test: add trace-level logging for S3 and HTTP in compaction tests
  test: make sstable test utilities natively async The original make_memtable used seastar::thread::yield() for preemption, which required all callers to run inside a seastar::thread context. This prevented the utilities from being used directly in coroutines or parallel_for_each lambdas. Make the primary functions — make_memtable, make_sstable_containing, and verify_mutation — return future<> directly. Callers now .get() explicitly when in seastar::thread context, or co_await when in a coroutine. make_memtable now uses coroutine::maybe_yield() instead of seastar::thread::yield(). verify_mutation is converted to coroutines as well. Requested in: https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
  test: move make_memtable out of external_updater in row_cache_test
  test: increase S3 max connections for compaction tests
2026-05-08 06:40:20 +03:00
Piotr Szymaniak
bc69fd7f11 alternator/streams: remove dead next_iter in get_records
The variable was constructed but never used — the original iterator
is returned instead. Fix the misleading comment to explain the
open-shard semantics of returning the original iterator.
2026-05-07 14:45:42 +02:00
Piotr Szymaniak
744848a85f test/alternator: fix stream wait timeouts to use wall-clock time
Both disable_stream and wait_for_active_stream used time.process_time()
for their timeouts, but process_time measures CPU time, not wall-clock
time. Since these loops spend most of their time sleeping and waiting on
API calls, the timeouts could last far longer than intended. Use
time.time() instead to enforce actual wall-clock deadlines.
2026-05-07 14:45:42 +02:00
Piotr Szymaniak
04b9214cf5 docs/alternator: document stream disable/re-enable behavior 2026-05-07 14:45:42 +02:00
Piotr Szymaniak
38bd068f78 alternator/streams: keep disabled streams usable and purge on re-enable
Previously, disabling Alternator Streams would create a blank
cdc::options with only enabled=false, which meant losing access also
to stored Streams's data (including preimage and postimage).

Now, when a stream is disabled:
- The existing CDC options are preserved (only 'enabled' is flipped to
  false), so StreamViewType remains available.
- DescribeStream enumerates all shards with EndingSequenceNumber set,
  indicating they are closed.
- GetRecords omits NextShardIterator for disabled streams.
- DescribeTable (supplement_table_stream_info) reports the stream ARN
  and StreamEnabled: false when the CDC log table still exists.
- ListStreams uses get_base_table instead of is_log_for_some_table so
  that disabled streams whose log table still exists are listed.

When a stream is re-enabled on an Alternator table that has an existing
(disabled) CDC log table, the old log table is dropped and a fresh one
is created with a new UUID, producing a new StreamArn. This is
Alternator-specific behavior; CQL CDC tables continue to reuse the
existing log table.

The old stream data is lost immediately upon re-enable. DynamoDB keeps
it readable for 24 hours.

Tests:
- test_streams_closed_read, test_streams_disabled_stream: remove xfail
  now that disabled streams are usable.
- test_streams_reenable: new test verifying that re-enabling produces
  a new ARN and the old data is still readable via the old ARN (xfail
  because Scylla currently purges old data on re-enable).

Fixes scylladb/scylladb#7239
2026-05-07 14:45:42 +02:00
Wojciech Mitros
ab12083525 test: propagate view update backlog before partition delete
In the test_delete_partition_rows_from_table_with_mv case we perform
a deletion of a large partition to verify that the deletion will
self-throttle when generating many view updates.
Before the deletion, we first build the materialized view, which causes
the view update backlog to grow. The backlog should be back to empty
when the view building finishes, and we do wait for that to happen, but
the information about the backlog drop may not be propagated to the
delete coordinator in time - the gossip interval is 1s and we perform
no other writes between the nodes in the meantime, so we don't make use
of the "piggyback" mechanism of propagating view backlog either. If the
coordinator thinks that the backlog is high on the replica, it may reject
the delete, failing this test.
We change this in this patch - after the view is built, we perform an
extra write from the coordinator. When the write finishes, the coordinator
will have the up-to-date view backlog and can proceed with the DELETE.
Additionally, we enable the "update_backlog_immediately" injection, which
makes the node backlog (the highest backlog across shards) update immediately
after each change.

Fixes: SCYLLADB-1795

Closes scylladb/scylladb#29775
2026-05-07 11:33:13 +03:00
Jenkins Promoter
454a8e6966 Update pgo profiles - aarch64 2026-05-07 10:09:36 +03:00
Andrzej Jackowski
eb241a7048 test: make preemptive abort coverage deterministic
The test used a real-time sleep to move the queued permit into the
preemptive-abort window. If the reactor did not get CPU for long
enough, admission could run only after the permit's timeout had
expired, making the expected abort path flaky.

The test also exhausted memory together with count resources, so the
queued permit could wait for memory. Preemptive abort is intentionally
not applied to permits waiting for memory, so keep enough memory
available and assert that the permit is queued only on count.

Use an immediate preemptive-abort threshold and a long finite timeout
to exercise admission-time abort without relying on scheduler timing.

Fixes: SCYLLADB-1796

Closes scylladb/scylladb#29736
2026-05-07 09:59:53 +03:00
Jenkins Promoter
5385df02ec Update pgo profiles - x86_64 2026-05-07 09:22:20 +03:00
Patryk Jędrzejczak
25fd1001c2 Merge 'alternator: improve CreateTable/UpdateTable schema agreement timeout' from Nadav Har'El
CreateTable and UpdateTable call wait_for_schema_agreement() after announcing the schema change, to ensure all live nodes have applied the new schema before returning to the user. This wait has a hard- coded 10 second timeout, and on some overloaded test machines we saw it not completing in time, and causing tests to become flaky.

This patch increases this timeout from 10 seconds to 30 seconds. It's still hard-coded and not configurable via alternator_timeout_in_ms because it is unlikely any user will want to change it - it just needs to be long.

The patch also improves the behavior of a schema-agreement timeout, when it happens:

1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the operation is unknown; So the user will repeat the CreateTable, and will get a ResourceInUseException because the table exists. In that case too, we need to wait for schema agreement. So we added this missing wait.

Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)

This patch is only important to improve test stability in extremely slow test machines where schema agreement sometimes (very rarely) takes over 10 seconds. It's not important to backport it to branches that don't run CI very often on slow machines.

Closes scylladb/scylladb#29744

* https://github.com/scylladb/scylladb:
  alternator: improve CreateTable/UpdateTable schema agreement timeout
  migration_manager: unique timeout exception for wait_for_schema_agreement()
2026-05-06 16:56:46 +02:00
Ferenc Szili
ec4b483e88 test: fix flaky test_tablets_split_merge_with_many_tables
In debug mode, this test can timeout during tablets merge. While the
test already decreases the number of tables in debug mode (20 tables,
instead of 200 for dev mode), this is not enough, and the test can still
timeout during merge. This change reduces the number of tables from 20
to 5 in debug mode.

It also drops the log level for lead_balancer to debug. This should make
any potential future problems with this test easier to investigate.

Fixes: SCYLLADB-1717

Closes scylladb/scylladb#29682
2026-05-06 17:02:10 +03:00
Petr Gusev
cab043323d test/cluster: fix test_lwt_fencing_upgrade flakiness during rolling upgrade
Replace the naive host.is_up check with wait_for_cql_and_get_hosts() which
actually executes a query against each host, ensuring the driver's connection
pool is fully re-established before proceeding to stop the last server.

The is_up flag is set asynchronously via gossip and doesn't guarantee the
connection pool has live TCP connections. After a server restart, the flag
may be True while the pool still holds stale connections. When the pool
monitor later discovers them dead it briefly marks the host DOWN, causing
NoHostAvailable if another server is being stopped concurrently.

Fixes SCYLLADB-1840

Closes scylladb/scylladb#29769
2026-05-06 15:40:09 +03:00
Tomasz Grabiec
d6346e68c1 Merge 'prevent gossiper from marking nodes as down in tests unexpectedly' from Patryk Jędrzejczak
This PR includes two changes that make gossiper much less likely to mark
nodes as down in tests unexpectedly, and cause test flakiness in issues
like SCYLLADB-864:
- fixing false node conviction when echo succeeds,
- increasing the failure_detector_timeout fixture.

Fixes: SCYLLADB-864

No need for backport: related CI failures are rare, and merging #29522
made them even more unlikely (I haven't seen one since then, but it's
still possible to reproduce locally on dev machines).

Closes scylladb/scylladb#29755

* github.com:scylladb/scylladb:
  test/cluster: increase failure_detector_timeout
  gossiper: fix false node conviction when echo succeeds
2026-05-06 14:01:15 +02:00
Piotr Dulikowski
1dccfeb988 Merge 'vector_search: test: fix flaky test_dns_resolving_repeated' from Karol Nowacki
The `vector_store_client_test_dns_resolving_repeated` test was intermittently
timing out on CI. The exact root cause is not fully understood, but the
hypothesis is that a single trigger signal can be lost somewhere (not exactly
known where). This is not an issue for the production code because refresh
trigger will be called multiple times whenever all configured nodes will be
unreachable.

Fixes SCYLLADB-1794

Backport to 2026.1 and 2026.2, as the same CI flakiness can occur on these branches.

Closes scylladb/scylladb#29752

* github.com:scylladb/scylladb:
  vector_search: test: default timeout in test_dns_resolving_repeated
  vector_search: test: fix flaky test_dns_resolving_repeated
2026-05-06 13:46:36 +02:00
Botond Dénes
8d22ef3058 Merge 'commitlog_test.py: Fix size check aliasing, and threshold calc and fix CL chunk size est.' from Calle Wilund
Fixes: SCYLLADB-1815

If we're in a brand new chunk (no buffer yet allocated), we would miscalculate the actual size of an entry to write, possibly causing segment size overshoot. Break out some logic to share between this calc and new_buffer. Also remove redundant (and possibly wrong) constant in oversized allocation.

As for the test:
Checking segment sizes should not use a size filter that rounds (up) sizes.
More importantly, the estimate for what is acceptable limit for commitlog disk usage should be aligned. Simplified the calc, and also made logging more useful in case of failure.

Closes scylladb/scylladb#29753

* github.com:scylladb/scylladb:
  commitlog_test.py: Fix size check aliasing, and threshold calc.
  commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
2026-05-06 13:48:41 +03:00
Piotr Dulikowski
321006ecbd Merge 'auth: fix crash on ghost rows in role_permissions' from Marcin Maliszkiewicz
The auth cache crashes when it encounters rows in role_permissions that have a live row marker but no permissions column. These “ghost rows” were created by the now-removed auth v2 migration, which used INSERT (creating row markers) instead of UPDATE.

When permissions were later revoked, the row marker remained while the permissions column became null. An empty collection appears as null, since its lifetime is based only on its element's cells.

As a result, when the cache reloads and expects the permissions column to exist, it hits a missing_column exception.

The series removes dead code that was the primary crash site, adds has() guards to the remaining access paths, and includes a test reproducer.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1816

Backport: all supported versions 2026.1, 2025.4, 2025.1

Closes scylladb/scylladb#29757

* github.com:scylladb/scylladb:
  test: add reproducer for auth cache crash on missing permissions column
  auth: tolerate missing permissions column in authorize()
  auth: add defensive has() guard for role_attributes value column
  auth: remove unused permissions field from cache role_record
2026-05-06 12:00:17 +02:00
Yaron Kaikov
65eabda833 pgo: fix ModuleNotFoundError in exec_cql.py by reverting safe_driver_shutdown
Commit cf237e060a introduced 'from test.pylib.driver_utils import
safe_driver_shutdown' in pgo/exec_cql.py. This module runs during PGO
profile training (a build step) where the test package is not on the
Python path, causing an immediate ModuleNotFoundError on both x86 and
ARM. Revert to plain cluster.shutdown() which is sufficient for the
single-use PGO training scenario.

Fixes: SCYLLADB-1792

Closes scylladb/scylladb#29746
2026-05-06 11:22:23 +02:00
Yaniv Michael Kaul
7557c64f20 test/cqlpy: add tests for hyphenated column names
Verify that double-quoted column names with hyphens (e.g. "my-col")
work correctly for CREATE TABLE, INSERT, and SELECT. Also verify that
unquoted hyphenated names are rejected with a syntax error.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
d13a56be2e docs/cql: fix UDT example to use frozen<address>
The 'address' UDT contains a nested collection (map<text, frozen<phone>>),
so it must be frozen when used as a column type. Non-frozen UDTs with
nested non-frozen collections are not supported.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
5c528e4e02 docs/cql: fix CREATE TABLE example with hyphenated column names
Column names containing hyphens must be double-quoted. Also fix
the PRIMARY KEY reference from 'customer_id' (non-existent) to
'cust_id' (the actual column).
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
3e2b0f844c docs/cql: fix missing opening quote in ALTER KEYSPACE example
The dc2 key was missing its opening single quote: dc2' should be 'dc2'.
2026-05-06 11:32:04 +03:00
Yaniv Michael Kaul
815aad50af docs/cql: fix INSERT example clause order (IF NOT EXISTS before USING)
The grammar requires IF NOT EXISTS to appear before USING TTL,
not after. The example had 'USING TTL 86400 IF NOT EXISTS' which
produces a syntax error.
2026-05-06 11:32:04 +03:00
Karol Nowacki
20b953ef8c vector_search: test: migrate paging warnings tests to Python
Move the paging warning related tests from C++ vector_store_client_test to
Python test_vector_search_with_vector_store_mock.
2026-05-05 18:23:30 +02:00
Karol Nowacki
84787ce6a5 vector_search: test: migrate local_vector_index to Python
Move the local vector index test from C++ vector_store_client_test to
Python test_vector_search_with_vector_store_mock.

The test creates a local vector index on ((pk1, pk2), embedding) and
verifies that SELECT with partition key restriction and ANN ordering
works correctly.
2026-05-05 18:23:30 +02:00
Karol Nowacki
0bb7e47090 vector_search: test: migrate vector_index_with_additional_filtering_column to Python
Move the SCYLLADB-635 regression test from C++ vector_store_client_test
to Python test_vector_search_with_vector_store_mock.

The test creates a vector index on (embedding, ck1) and verifies that
SELECT with ANN ordering works correctly when additional filtering
columns are included in the index definition.
2026-05-05 18:23:30 +02:00
Karol Nowacki
5a8af3c727 vector_search: test: migrate cql_error_contains_http_error_description to Python
Move the test that verifies HTTP error descriptions from the vector
store are propagated through CQL InvalidRequest messages from the C++
vector_store_client_test to the Python test_vector_search_with_vector_store_mock.

The test configures the mock to return HTTP 404 with 'index does not
exist' and asserts the CQL SELECT raises InvalidRequest containing '404'.
2026-05-05 18:23:30 +02:00
Karol Nowacki
b672972c5f vector_search: test: migrate pk in restriction test to Python
Move vector search (ANN ordered select query) with IN restrictions on
partition key from C++/Boost test suite to pytest (cqlpy).

Add VectorStoreMock server as pytest fixture to simulate vector store
responses.
2026-05-05 18:23:30 +02:00
Karol Nowacki
207de967fb vector_search: test: default timeout in test_dns_resolving_repeated
Replace explicit 1-second timeouts in repeat_until() with the default
STANDARD_WAIT (10s). The 1-second timeout could be too aggressive for
loaded CI environments where lowres_clock granularity (~10ms) combined
with OS scheduling delays and resource contention (-c2 -m2G) could cause
the loop to expire before the DNS refresh task completes its cycle.

This also unifies test timeouts across test cases.
2026-05-05 17:23:39 +02:00
Karol Nowacki
4722be1289 vector_search: test: fix flaky test_dns_resolving_repeated
Move trigger_dns_resolver() inside the repeat_until loop instead of
calling it once before the loop.

The test was intermittently timing out on CI. The exact root cause is not
fully understood, but the hypothesis is that a single trigger signal can
be lost somewhere (not exactly known where). This is not an issue for the
production code because refresh trigger will be called multiple times -
in every query where all configured nodes will be unreachable.

By triggering inside the loop, we ensure the signal is re-sent on
each iteration until the resolver actually performs the refresh and
picks up the new (failing) DNS resolution. This makes the test
resilient to timing-dependent signal loss without changing production
code.

Fixes: SCYLLADB-1794
2026-05-05 17:23:39 +02:00
Marcin Maliszkiewicz
5c5306c692 test: add reproducer for auth cache crash on missing permissions column 2026-05-05 17:16:25 +02:00
Marcin Maliszkiewicz
df69a5c79b auth: tolerate missing permissions column in authorize()
Ghost rows in role_permissions with a live row marker but no permissions
column can occur when permissions created via INSERT (e.g. by the removed
auth v2 migration) are later revoked. The row marker survives the revoke,
leaving a row visible to queries but with permissions=null.

Add a has() guard before accessing the permissions column, matching the
pattern already used in list_all(). Return NONE permissions for such
ghost rows instead of crashing.
2026-05-05 15:50:40 +02:00
Marcin Maliszkiewicz
c44625ebdf auth: add defensive has() guard for role_attributes value column
Add a has() check before accessing the value column in role_attributes
to tolerate ghost rows with missing regular columns. In practice this
is unlikely to be a problem since attributes are not typically revoked,
but the guard is added for consistency and defensive programming.
2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz
797bc28aae auth: remove unused permissions field from cache role_record
The permissions field in role_record was populated by fetch_role() but
never read. Authorization uses cached_permissions instead, which is
loaded via the permission_loader callback. Remove the dead field and
its fetch code.

The removed code also did not check for missing columns before accessing
the permissions set, which could crash on ghost rows left by the removed
auth v2 migration. The migration used INSERT (creating row markers),
and when permissions were later revoked, the row marker survived while
the permissions column became null.
2026-05-05 15:48:01 +02:00
Marcin Maliszkiewicz
c00fee0316 Merge 'utils: loading_cache: add insert() that is a no-op when caching is disabled' from Dario Mirovic
When `permissions_validity_in_ms` is set to 0, executing a prepared statement under authentication crashes with:
```
    Assertion `caching_enabled()' failed.
        at utils/loading_cache.hh:319
        in authorized_prepared_statements_cache::insert
```

`loading_cache::get_ptr()` asserts when caching is disabled (expiry == 0), but `authorized_prepared_statements_cache::insert()` was using it purely for its side effect of populating the cache, which is meaningless when caching is off.

Add a new `loading_cache::insert(k, load)` method that is a no-op when caching is disabled and otherwise forwards to `get_ptr()`. Switch `authorized_prepared_statements_cache::insert()` to use it. This
completes the disabled-mode safety contract of the cache for the write side, mirroring the fallback that `get()` already provides for the read side.

Includes a regression test in `test/boost/loading_cache_test.cc` plus a positive test for the new `insert()` overload.

Fixes SCYLLADB-1699

The crash is introduced a long time ago. It is present on all the live versions, from 2025.1 onward. No client tickets, but it should be backported.

Closes scylladb/scylladb#29638

* github.com:scylladb/scylladb:
  test: boost: regression test for loading_cache::insert with caching disabled
  utils: loading_cache: add insert() that is a no-op when caching is disabled
2026-05-05 15:33:49 +02:00
Patryk Jędrzejczak
9f692857be test/cluster: increase failure_detector_timeout
Scaling the timeout by build mode (#29522) turned out to be not sufficient.
Nodes can still be unexpectedly marked as down, even with a 4s timeout in
dev mode. I managed to reproduce SCYLLADB-864 in such conditions.

Increasing failure_detector_timeout will proportionally slow down tests
that use it. That's bad, but currently these tests' flakiness is a much
bigger problem than the tests' slowness. Also, not many tests use this
fixture, and we hope to make it unneeded eventually (see #28495).
2026-05-05 15:12:33 +02:00
Patryk Jędrzejczak
efe0e39d85 gossiper: fix false node conviction when echo succeeds
failure_detector_loop_for_node() could falsely convict a healthy node
even when the echo succeeded. The code computed diff = now - last
(time since last successful echo) and checked diff > max_duration
unconditionally, regardless of whether the current echo failed or
succeeded.

This caused flakiness in tests that decrease the failure detector
timeout. We currently run #CPUs tests concurrently, and since cluster
tests start multiple nodes with 2 shards, multiple shards contend for
one CPU. As a result, some tasks can become abnormally slow and block
the failure detector loop execution for a few seconds.

Fix by only checking diff > max_duration when the echo actually
failed.

Note that we send echo with the timeout equal to `max_duration` anyway,
so the receiver will be marked as down if it really doesn't respond.
2026-05-05 15:12:32 +02:00
Patryk Jędrzejczak
b69d00b0a7 Merge 'Barrier and drain logging' from Gleb Natapov
Add more logging to barrier and drain rpc to try and pinpoint https://github.com/scylladb/scylladb/issues/26281

Bakport since we want to have it if it happens in the field.

Fixes: SCYLLADB-1821
Refs: #26281

Closes scylladb/scylladb#29735

* https://github.com/scylladb/scylladb:
  session, raft_topology: add periodic warnings for hung drain and stale version waits
  session: add info-level logging to drain_closing_sessions
  raft_topology: log sub-step progress in local_topology_barrier
  raft_topology: log read_barrier progress in topology cmd handler
2026-05-05 15:04:50 +02:00
Calle Wilund
5cdfdd9ba3 commitlog_test.py: Fix size check aliasing, and threshold calc.
Fixes: SCYLLADB-1815

Checking segment sizes should not use a size filter that rounds
(up) sizes.
More importantly, the estimate for what is acceptable limit for
commitlog disk usage should be aligned. Simplified the calc, and
also made logging more useful in case of failure.
2026-05-05 14:42:55 +02:00
Nadav Har'El
b70beb3e13 alternator: improve CreateTable/UpdateTable schema agreement timeout
CreateTable and UpdateTable call wait_for_schema_agreement() after
announcing the schema change, to ensure all live nodes have applied
the new schema before returning to the user. This wait has a hard-
coded 10 second timeout, and on some overloaded test machines we
saw it not completing in time, and causing tests to become flaky.

This patch increases this timeout from 10 seconds to 30 seconds.
It's still hard-coded and not configurable via alternator_timeout_in_ms
because it is unlikely any user will want to change it - it just needs
to be long.

The patch also improves the behavior of a schema-agreement timeout,
when it happens:

1. Provide an InternalServerError with more descriptive text.
2. This InternalServerError tells the user that the result of the
   operation is unknown; So the user will repeat the CreateTable, and
   will get a ResourceInUseException because the table exists. In that
   case too, we need to wait for schema agreement. So we added this
   missing wait.

Fixes SCYLLADB-1804
Refs #5052 (claiming CreateTable shouldn't wait at all)

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 15:41:06 +03:00
Calle Wilund
8d65a03951 commitlog: Fix segment/chunk overhead maybe not included in next_position calculation
Refs: SCYLLADB-1757
Refs: SCYLLADB-1815

If we're in a branch new chunk (no buffer yet allocated), we would miscalculate the
actual size of an entry to write, possibly causing segment size overshoot.

Break out some logic to share between this calc and new_buffer. Also remove redundant
(and possibly wrong) constant in oversized allocation.
2026-05-05 14:39:06 +02:00
Nadav Har'El
1f15e05946 test: fix replica_read_timeout_no_exception flakiness on slow systems
The test uses a 10ms read timeout to exercise code paths that handle
timed-out reads without throwing C++ exceptions.  As part of setup, it
inserts rows and flushes them to two SSTables, then runs a warm-up
SELECT to populate internal caches (e.g. the auth cache) before the
real test begins.

The reason for this warm-up read was the possibility that the first
read does additional operations (such as reading and caching
authentication) that might throw exceptions internally. I couldn't
verify that such exceptions actually happen in today's code, but
they might (re)appear in the future, so we should keep the warm-up
SELECT.

On slow CI machines (aarch64, debug build), that warm-up SELECT can
take longer than 10ms to read from the two SSTables.  When it does, the
read times out: the coordinator receives 0 responses from the local
replica within the deadline and propagates a read_timeout_exception.
Since the exception is not caught, it escapes the test lambda, is
logged as "cql env callback failed", and causes Boost.Test to report a
C++ failure at the do_with_cql_env_thread call site.  This matches the
CI failure seen in SCYLLADB-1774:

  ERROR ... replica_read_timeout_no_exception: cql env callback failed,
  error: exceptions::read_timeout_exception (Operation timed out for
  replica_read_timeout_no_exception.tbl - received only 0 responses
  from 1 CL=ONE.)

The CI log also shows that only 12 reads were admitted (the warm-up
read plus the 11 reads from the two prepare() calls and CREATE/INSERT
statements made earlier), and the current permit was stuck in
need_cpu state -- the reactor hadn't had a chance to schedule the read
before the 10ms window elapsed.

The fix catches read_timeout_exception from the warm-up SELECT and
retries until the read succeeds. The warm-up is required for
correctness: some lazy-init code paths (e.g. auth cache population)
use C++ exceptions for control flow internally. Those exceptions must
be absorbed before the cxx_exceptions baseline is sampled inside
execute_test(); otherwise they would appear in the delta and cause a
false test failure. Simply ignoring a timed-out warm-up is not safe,
because the lazy-init exceptions would then fire during the 1000 test
reads, inflating cxx_exceptions_after relative to
cxx_exceptions_before.

No other calls in setup are susceptible to the 10ms read timeout:
- CREATE KEYSPACE, CREATE TABLE, INSERT, and flush use the write
  timeout (10s) and are not reads.
- e.prepare() goes through the query processor without reading table
  data, so it is not subject to the read timeout.
- The semaphore manipulation in Test 2 is internal and has no timeout.
- All 1000 reads in execute_test() are expected to fail, so a timeout
  there is the happy path, not a failure.

The 10ms timeout itself is fine for the test's purpose: it is
deliberately aggressive so that reads reliably time out on the hot path
being tested.  The problem was only that the pre-test warm-up was not
guarded against the same timeout.

Fixes: SCYLLADB-1774

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29731
2026-05-05 15:13:13 +03:00
Botond Dénes
afd9a55891 Merge 'test/cluster: wait for custom listener readiness' from Piotr Smaron
server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every custom listener the test will connect to is already accepting raw TCP connections.
test_proxy_protocol_ssl_shard_aware connects directly to the shard-aware TLS proxy-protocol CQL port immediately after server startup. Wait for ServerUpState.SERVING in the fixture so the custom proxy-protocol listener is registered before opening raw sockets.
test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently. Wait for ServerUpState.SERVING before opening raw sockets.
test_perf_alternator_remote now asks server_add() to wait for SERVING and uses the returned server address directly. This removes the redundant running_servers() plus get_ready_cql() sequence noted in review.

Fixes: SCYLLADB-1797

No backport as of now, only appeared on master.

Closes scylladb/scylladb#29737

* github.com:scylladb/scylladb:
  test/cluster: avoid redundant perf alternator CQL wait
  test/cluster: wait for shard-aware CQL listener
  test/cluster: wait for proxy protocol ports to serve
2026-05-05 14:45:58 +03:00
Nadav Har'El
5895dff03b migration_manager: unique timeout exception for wait_for_schema_agreement()
Before this patch, if wait_for_schema_agreement() times out, it threw
a generic std::runtime_error, making it inconvenient for callers to
catch this error only. So in this patch we create and use a new exception
type, schema_agreement_timeout, based on seastar::timed_out_error.

Although wait_for_schema_agreement() was added in commit
a429018a8a was a utility function used in
a dozen places, it has become less interesting after we introduced schema
changes over Raft, and over the years most of the callers to this function
were removed, except one in view.cc which uses an infinite timeout, so
doesn't care about the timeout exception type.

In the next patch we want to add a new caller which *does* care about
the time exception type - hence this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-05-05 10:38:38 +03:00
Piotr Dulikowski
efcc0b6376 Merge 'table_helper: fix use-after-free on prepared-statement invalidation' from Marcin Maliszkiewicz
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:

1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
   the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
   (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
   releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.

Pin strong ref to _insert_stmt locally before the suspension.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1667

Backport: all supported branches, it's memory corruption bug, long present

Closes scylladb/scylladb#29588

* github.com:scylladb/scylladb:
  test/boost: add dummy case to table_helper_test for non-injection modes
  test/boost: add regression test for table_helper insert() UAF
  utils/error_injection: add waiters() API
  table_helper: fix use-after-free on prepared-statement invalidation
2026-05-04 17:21:05 +02:00
Piotr Smaron
a3360ee385 test/nodetool: fix mock server port race by using a fixed port on a unique IP
Symptom: the rest_api_mock subprocess exits with status 1 during fixture
setup, e.g.:

    subprocess.CalledProcessError: Command '[..., 'rest_api_mock.py',
        '127.29.88.1', '34093']' returned non-zero exit status 1

Root cause: aiohttp's TCPSite.start() raises OSError(EADDRINUSE) and the
process exits 1. The bind fails because of how the (ip, port) pair is
chosen across modules within one test.py process:

  * Each test module leases a 127.x.y.z IP from the host registry. The
    registry recycles released IPs, so the same IP is shared across
    modules sequentially.
  * The original code picked the port via random.randint(10000, 65535).
    A previous module on the same IP could have left that port in
    TIME_WAIT (or worse, still actively in use) when a later module
    happened to pick the same port.

SCYLLADB-1275 (PR 29314) tried to fix this by binding a probe socket to
(ip, 0) to obtain an OS-assigned free port, closing the probe, then
launching the mock server which would bind to that port. Two issues
remained:

  1. TOCTOU: between probe close and mock-server bind, any other process
     on the host could grab the just-freed port.
  2. TIME_WAIT could still bite if the host registry recycled an IP and
     the OS reused the same port number for the probe.

Fix: drop port discovery entirely. Use a fixed port (12345, matching the
unshare-namespace path already in this fixture) on the unique IP from
the host registry. Because IPs are unique per test module within one
test.py process, the (ip, 12345) pair is unique to each module, so no
port-collision dance is needed.

reuse_address=True on TCPSite handles the residual TIME_WAIT case when
the host registry recycles an IP within the same test.py process and
the previous mock server's socket has not finished TIME_WAIT yet.
reuse_port=True is dropped, as it was only useful while attempting to
have multiple processes share a single port.

This mirrors the design used in test/cqlpy/run.py: pick a unique IP,
keep the port fixed.

Fixes: SCYLLADB-1718

Closes scylladb/scylladb#29656
2026-05-04 15:33:19 +02:00
Gleb Natapov
d2b695aa64 session, raft_topology: add periodic warnings for hung drain and stale version waits
Add periodic warning timers (every 5 minutes) to help diagnose hangs in
barrier_and_drain:

- drain_closing_sessions(): warn if semaphore acquisition or session gate
  close is taking too long, reporting the gate count to show how many
  guards are still alive.
- local_topology_barrier(): warn if stale_versions_in_use() is taking
  too long, reporting the current stale version trackers.
- session::gate_count(): new public accessor for diagnostic purposes.

These warnings help distinguish between the two possible hang points
in barrier_and_drain (stale versions vs session drain) and provide
ongoing visibility into what's blocking progress.
2026-05-04 15:58:45 +03:00
Gleb Natapov
385915c101 session: add info-level logging to drain_closing_sessions
drain_closing_sessions() is called as part of the barrier_and_drain
topology command and can block on two things: acquiring the drain
semaphore (if another drain is in progress) and waiting for individual
sessions to close (which blocks until all session guards are released).

Previously, all logging in this function was at debug level, making it
invisible in production logs. When barrier_and_drain hangs, there is no
way to tell whether the function is waiting for the semaphore, waiting
for a specific session to close, or was never called.

Promote logging to info level and add messages at each blocking point:
before/after semaphore acquisition (with count of sessions to drain),
before/after each individual session close (with session id), and at
function completion. This makes it possible to identify the exact
session blocking a topology operation from the node log alone.
2026-05-04 15:58:45 +03:00
Gleb Natapov
e88ce09372 raft_topology: log sub-step progress in local_topology_barrier
When a node processes a barrier_and_drain topology command, it performs
two potentially long-running operations inside local_topology_barrier():
waiting for stale token metadata versions to be released
(stale_versions_in_use) and draining closing sessions
(drain_closing_sessions). Either of these can hang indefinitely -- for
example, stale_versions_in_use blocks until all references to previous
token metadata versions are released, which depends on in-flight
requests completing.

Previously, the only logging was a single 'done' message at the end,
making it impossible to determine which sub-step was blocking when a
barrier_and_drain RPC appeared stuck on a node. In a recent CI failure,
a node never responded to barrier_and_drain during a removenode
operation, and the logs showed the RPC was received but nothing about
what it was waiting on internally.

Add info-level logging before each blocking sub-step, including the
topology version for correlation. This allows diagnosing hangs by
showing whether the node is stuck waiting for stale metadata versions,
stuck draining sessions, or never reached these steps at all.
2026-05-04 15:58:45 +03:00
Piotr Smaron
0a780d0ea1 test/cluster: avoid redundant perf alternator CQL wait
server_add() already waits for the requested server-up state. For the remote perf-alternator test, request SERVING from server_add() and use the returned server address directly instead of asking for running servers and then calling get_ready_cql() again.

This keeps the listener-readiness intent explicit while removing the redundant CQL readiness probe noted in review.
2026-05-04 14:09:28 +02:00
Piotr Smaron
c90012c22b test/cluster: wait for shard-aware CQL listener
server_add() defaults to CQL_ALTERNATOR_QUERIED. That proves the regular CQL driver path is queryable, and regular Alternator ports listed in YAML config if any. It does not prove that every CQL listener configured for the process is already accepting raw TCP connections.

test_uninitialized_conns_semaphore opens a raw TCP connection to native_shard_aware_transport_port immediately after startup. The default readiness check can succeed through native_transport_port while the shard-aware listener is still being started, because CQL listeners are registered independently.

Wait for ServerUpState.SERVING before opening raw sockets. Scylla sends that notification only after protocol servers are registered, so this closes the startup window without adding sleeps or local retry loops.

Fixes: SCYLLADB-1797
2026-05-04 13:36:43 +02:00
Nadav Har'El
983eb5ab43 test/cluster/auth_cluster: use CREATE ROLE IF NOT EXISTS to fix flaky test
test_create_role_mixed_cluster calls servers_add(2) to bootstrap two old
nodes concurrently, then adds a new node before issuing CREATE ROLE.  The
concurrent bootstraps trigger the well-known Python driver bug
(scylladb/python-driver#317): two on_add notifications race in
update_created_pools, causing a second pool to be created for a host whose
pool was already established.  If CREATE ROLE is in-flight on the old pool
when it is closed, the driver retries on the new pool, executing the
statement twice.  The second execution fails with "Role ... already exists",
making the test flaky.

Fix by using CREATE ROLE IF NOT EXISTS.  This is safe because unique_name()
generates a timestamp+random suffix that is guaranteed to be unique; the
role can "already exist" only due to the driver double-execution bug, never
due to a real conflict.

This is the same workaround that has been applied many times elsewhere in
our test suite for exactly the same root cause:
- CREATE KEYSPACE was changed to CREATE KEYSPACE IF NOT EXISTS (scylladb#18368,
  later generalised in scylladb#22399 via new_test_keyspace helpers)
- DROP KEYSPACE was changed to DROP KEYSPACE IF EXISTS (scylladb#29487)

Fixes: SCYLLADB-1742

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#29732
2026-05-04 11:47:11 +02:00
Yaniv Michael Kaul
6179406467 raft/group0: fix destroy assertion on startup failure
If start_server_for_group0() successfully registers a server in
_raft_gr._servers but a subsequent step (e.g. enable_in_memory_state_machine())
throws, the server is never destroyed because abort_and_drain()/destroy()
check std::get_if<raft::group_id>(&_group0) which was only set after the
entire with_scheduling_group block completed.

Move _group0.emplace<raft::group_id>() inside the lambda, immediately after
start_server_for_group() succeeds, so that cleanup paths can always find
and destroy the registered server.

This fixes the assertion:
  "raft_group_registry - stop(): server for group ... is not destroyed"

which manifests during shutdown after an upgrade where topology_state_load()
fails due to netw::unknown_address.

Backport: Yes, to 2026.1, 2026.2, as it causes a crash on upgrades

Refs: SCYLLADB-1217
Refs: CUSTOMER-340
Refs: CUSTOMER-335
Fixes: SCYLLADB-1801
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
AI-assisted: Yes, Opencode/Opus 4.6

Closes scylladb/scylladb#29702
2026-05-04 11:25:46 +02:00
Piotr Smaron
689117f706 test/cluster: wait for proxy protocol ports to serve
server_add()'s default readiness only waits until CQL can be queried, but these tests immediately connect to custom proxy protocol listeners. Wait for SERVING so the shard-aware TLS proxy port is accepting connections before the test starts, matching the Alternator proxy protocol readiness fix.
2026-05-04 10:23:03 +02:00
Nadav Har'El
d33bb6ea00 Merge 'test: fix race window test flakiness from residual re-repair' from Avi Kivity
Fix the persistent flakiness in `test_incremental_repair_race_window_promotes_unrepaired_data` (SCYLLADB-1478, reopened).

After restarting servers[1], the topology coordinator can initiate a **residual re-repair** when it sees tablets stuck in the `repair` stage. This re-repair flushes memtables on all replicas and marks post-repair data as repaired, contaminating the test state and masking the compaction-merge bug the test is designed to detect. The assertion then fails on the *next* retry because the previous attempt's re-repair left behind repaired sstables containing post-repair keys.

1. **Propagating `current_key` through the exception** — correctly advanced the key counter on retry, but the contaminated tablet metadata from the prior re-repair (repaired sstables with post-repair keys) was still present, causing assertion failures on the next attempt.

2. **DROP TABLE + CREATE TABLE between retries** — the tablet metadata (sstables_repaired_at, repair stage) is tied to the tablet identity, and recreating the table in the same keyspace still showed residual state issues.

Instead of trying to clean up contaminated state, each retry creates a **completely fresh keyspace** (unique name via `create_new_test_keyspace`). This gives entirely new tablets with no residual repair metadata from prior attempts. Combined with broader detection of coordinator changes and residual re-repairs, the test reliably retries before any contamination can cause false failures.

The detection is now comprehensive:
- **Broadened coordinator check**: any coordinator change (`new_coord != coord`), not just migration to servers[1]
- **Re-repair detection** at three points: post-restart, during the compaction poll, and after injection release — grep for `"Initiating tablet repair host="` in the coordinator log

1. **`test: extract _setup_table_for_race_window helper`** — pure code-movement refactor that extracts keyspace+table+data+repair1+data+flush into a reusable helper. Easily verifiable as a no-op behavioral change.

2. **`test: fix race window test flakiness from residual re-repair`** — the actual fix: broadened detection logic + re-repair grep at 3 points + fresh-keyspace retry on exception.

Passed 1000 consecutive runs with the fix applied. Without the fix, about 2% flakiness was observed in debug mode.

Fixes: SCYLLADB-1478

So far, we haven't observed flakiness of this test on branches, so not backporting yet. Will backport if seen.

Closes scylladb/scylladb#29721

* github.com:scylladb/scylladb:
  test: fix race window test flakiness from residual re-repair
  test: extract _setup_table_for_race_window helper for race window test
2026-05-03 14:47:19 +03:00
Gleb Natapov
11b838e71e raft_topology: log read_barrier progress in topology cmd handler
When a raft topology command (e.g. barrier_and_drain) is received by a
node, the handler first performs a raft read_barrier to ensure it sees
the latest topology state. This read_barrier can hang indefinitely if
raft cannot achieve quorum, but there was no logging around it, making
it impossible to tell whether the handler was stuck at this step or
somewhere else.

Add info-level logging before and after the read_barrier call in
raft_topology_cmd_handler, including the command type, index, and term.
This allows diagnosing hangs by showing whether the node entered the
read_barrier and whether it completed, narrowing down the root cause
when a topology command RPC appears stuck on the receiver side.
2026-05-03 13:56:25 +03:00
Aleksandr Bykov
8afdae24d2 test: fix flaky test_kill_coordinator_during_op
The test hardcoded the expected number of coordinator elections
(2, 3, 4, 5) for each phase. If a prior phase triggered an extra
election, subsequent phases would wait for a count that was already
reached or would never match.

Fix by reading the current election count before each operation and
expecting exactly one more, making each phase independent of prior
history.

Also add wait_for_no_pending_topology_transition() calls after each
coordinator election to ensure the topology state machine has fully
settled before proceeding with restarts and further operations.

Decrease the failure detector timeout (failure_detector_timeout_in_ms)
to 2000 ms on all test nodes so that coordinator crashes are detected
faster, reducing test wallclock time and timeout-related flakiness.

Enable raft_topology=trace logging on all test nodes to aid
post-failure diagnosis. Add diagnostic logging in
wait_new_coordinator_elected().

Fixes: SCYLLADB-1089

Closes scylladb/scylladb#29284
2026-04-30 21:27:56 +03:00
Avi Kivity
795478fa7a test: fix race window test flakiness from residual re-repair
The test_incremental_repair_race_window_promotes_unrepaired_data test
was still flaky because:

1. Only coordinator changes TO servers[1] were detected, but ANY
   coordinator change can trigger a residual re-repair that flushes
   memtables on all replicas and marks post-repair data as repaired.

2. Even without a coordinator change, the topology coordinator can
   initiate a residual re-repair when it sees tablets stuck in the
   repair stage after the servers[1] restart.  This re-repair
   contaminates the repaired set with post-repair data, masking the
   compaction-merge bug the test detects.

Fix by:
- Broadening the coordinator check from == servers[1] to != coord
- Adding re-repair detection (grep for 'Initiating tablet repair
  host=') at three points: post-restart, during the compaction poll,
  and after injection release
- On retry, creating a completely fresh keyspace+table via
  _setup_table_for_race_window() so the new attempt starts with
  clean tablet metadata uncontaminated by prior re-repairs

Fixes: SCYLLADB-1478
2026-04-30 18:40:18 +03:00
Avi Kivity
12d5e758ed test: extract _setup_table_for_race_window helper for race window test
Move the keyspace+table setup logic for
test_incremental_repair_race_window_promotes_unrepaired_data into a
dedicated helper function _setup_table_for_race_window().  The helper
creates a fresh keyspace (unique name via create_new_test_keyspace),
the table, configures STCS min_threshold=2, inserts baseline keys,
runs repair 1, inserts keys for repair 2, and flushes.

This is a pure refactor with no behavioral change: the test function
now calls the helper once instead of inlining the setup.  The
extraction enables a subsequent commit to call the helper again on
retry when a leadership transfer is detected.
2026-04-30 18:37:42 +03:00
Dario Mirovic
3875d79ac6 test: boost: regression test for loading_cache::insert with caching disabled
Add two test cases for the new loading_cache::insert() method:

 * test_loading_cache_insert verifies that insert() populates the cache
   and invokes the loader exactly once per key when caching is enabled.

 * test_loading_cache_insert_caching_disabled is a regression test for
   SCYLLADB-1699: when the cache is constructed with expiry == 0
   (caching disabled), insert() must be a no-op rather than asserting
   in loading_cache::get_ptr() via caching_enabled(). The loader must
   not be invoked and the cache must remain empty.

Refs SCYLLADB-1699
2026-04-30 16:52:51 +02:00
Dario Mirovic
918130befd utils: loading_cache: add insert() that is a no-op when caching is disabled
When the cache is constructed with expiry == 0 the underlying storage is
never instantiated and get_ptr() asserts via caching_enabled(). This is
fine for callers that need a handle into the cache, but it makes get_ptr()
unusable for write-only insertions on caches whose expiry is configurable
at runtime (e.g. caches driven by a LiveUpdate config option that the
operator may set to 0).

Add a new insert(k, load) method on loading_cache that returns a future<>
and is a no-op when caching is disabled, otherwise forwards to
get_ptr(k, load) and discards the resulting handle. This completes the
disabled-mode safety contract of the cache for the write side, mirroring
the fallback that get() already provides for the read side.

Switch authorized_prepared_statements_cache::insert() from
get_ptr().discard_result() to the new insert(), which fixes the crash
'Assertion caching_enabled() failed' in
authorized_prepared_statements_cache::insert() that occurs when
permissions_validity_in_ms is set to 0 and a prepared statement is
executed under authentication.

Fixes SCYLLADB-1699
2026-04-30 16:51:23 +02:00
Marcin Maliszkiewicz
b08e0c67e4 test/boost: add dummy case to table_helper_test for non-injection modes
The only test requires SCYLLA_ENABLE_ERROR_INJECTION. In modes without it
(e.g. release) the suite was empty, so pytest exited with code 5
("no tests collected") and CI failed. Add a no-op case in that branch
so collection always yields at least one test.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
515b5722fd test/boost: add regression test for table_helper insert() UAF
Deterministic reproducer using an error injection point placed in
table_helper::insert() between cache_table_info() and execute(). The
test parks fiber A at the injection, drops the target table (evicting
the prepared_statements_cache entry), runs fiber B which nulls
_insert_stmt, then releases fiber A. Without the fix this crashes in
execute(); with the fix fiber A holds a local strong ref and proceeds.

Uses the new waiters() API to synchronize with fiber A's entry into
the injection.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
4d234aaaa5 utils/error_injection: add waiters() API
Returns the number of fibers currently suspended in wait_for_message()
for a named injection. Lets tests synchronize precisely with code parked
on an injection point.
2026-04-30 11:45:12 +02:00
Marcin Maliszkiewicz
aa18c3ed4a table_helper: fix use-after-free on prepared-statement invalidation
insert() held no local strong ref to the prepared modification_statement
across the suspension in execute(). On a single shard:

1. Fiber A suspends inside _insert_stmt->execute().
2. DROP TABLE / DROP KEYSPACE on the target, or LRU eviction, removes
   the prepared_statements_cache entry, releasing its strong ref.
3. Fiber B re-enters cache_table_info(), sees _prepared_stmt
   (checked_weak_ptr) invalidated, and runs _insert_stmt = nullptr,
   releasing the last strong ref. The modification_statement is freed.
4. Fiber A resumes inside execute() and touches freed *this.

Pin strong ref to _insert_stmt locally before the suspension.
2026-04-30 11:45:12 +02:00
Ernest Zaslavsky
1febfbd9b5 test: rename sstable_tablet_streaming.cc to match the naming convention
apparently, boost test MUST end with "_test" to be executed by the test.py

Closes scylladb/scylladb#29693
2026-04-30 11:16:39 +03:00
Pavel Emelyanov
1ca97f0c0a Merge 'test: fix disabled test handling and deduplicate CLI test arguments' from Evgeniy Naydanov
- Revert the previous "test.py: fix test collection bug" commit (92c09d10) which worked around broken deduplication by filtering items without `BUILD_MODE` in `pytest_collection_modifyitems`. This approach masked the root cause and is superseded by the proper fixes below.
- Backport pytest 9.0.3's argument normalization algorithm into `test.py` to work around broken deduplication in pytest 8.3.5 ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083)). Duplicate or subsumed test paths (e.g. `test/cql` and `test/cql/lua_test.cql`) are collapsed before invoking pytest. Revert when upgrading to pytest 9.x.
- Return a `DisabledFile` collector instead of an empty list in `pytest_collect_file` when all modes are disabled for a file, fixing a bug where subsequent files would not get their stash items set (`REPEATING_FILES`). Restructure `pytest_collect_file` to use a walrus operator (`if repeats := ...`) with a single `remove(file_path)` and `return collectors` at the end, eliminating the early return.
- Add `--keep-duplicates` CLI argument to bypass deduplication and forward to pytest.
- Move `RUN_ID` assignment from `pytest_collect_file` to `modify_pytest_item`. A shared `run_ids` cache (`defaultdict[tuple[str, str], count]`) is created in `pytest_collection_modifyitems` and passed to `modify_pytest_item`, keyed by `(build_mode, nodeid)` so each mode gets independent counters. This ensures unique run IDs even when `--keep-duplicates` causes the same file to be collected multiple times.
- Fix `--repeat` option default from string `"1"` to int `1` — argparse only applies `type=` to CLI-parsed values, not defaults.

pytest normally deduplicates overlapping test arguments — e.g. `test/cql test/cql/lua_test.cql` collects `lua_test.cql` only once. The original `test.py` never performed this deduplication, and the pytest version in the toolchain image (8.3.5) has a bug that breaks it ([pytest-dev/pytest#12083](https://github.com/pytest-dev/pytest/issues/12083).)

Since we are moving to bare pytest, `test.py` should match pytest's default behavior: deduplicate. Because we cannot easily upgrade pytest, commit 2 backports the deduplication logic from pytest 9.0.3.

To match pytest's interface, `--keep-duplicates` is added as an opt-out. This lets a user intentionally run overlapping paths — e.g. `./test.py test/blah test/blah/test_foo.py --keep-duplicates` runs `test_foo.py` twice. The flag is forwarded to pytest and also skips the backported deduplication in `test.py`.

- Revert 92c09d10 which filtered items without `BUILD_MODE` in `pytest_collection_modifyitems` and added an early return in `CppFile.collect()`. This workaround is superseded by the proper deduplication and `DisabledFile` fixes.

- Add `_CollectionArgument` dataclass (`order=True`, `__contains__` for subsumption) and `_deduplicate_test_args()` function, adapted from pytest 9.0.3. Marked with a TODO to remove once we update to pytest 9.x.
- Call `_deduplicate_test_args()` on `options.name` before passing to pytest.

- Add `DisabledFile(pytest.File)` that skips collection with an informative message instead of returning an empty list.
- Restructure `pytest_collect_file` to use walrus operator: `if repeats := ...:` / `else:` — single `remove(file_path)` at end, no early return.

- Add `--keep-duplicates` argument that skips deduplication and is forwarded to pytest.
- Create a shared `run_ids` cache in `pytest_collection_modifyitems` and pass it to `modify_pytest_item`, which assigns unique sequential RUN_IDs via `itertools.count`. The cache is keyed by `(build_mode, nodeid)` so each mode gets independent counters.
- Remove `RUN_ID` from `_STASH_KEYS_TO_COPY` — it is no longer set on collectors.
- Remove `CppFile.run_id` cached_property. `CppTestCase` now reads `RUN_ID` from its own item stash.
- Fix `--repeat` option default from `"1"` to `1` and drop redundant `int()` cast.

Closes SCYLLADB-1730

Closes scylladb/scylladb#29665

* github.com:scylladb/scylladb:
  test: add --keep-duplicates and assign RUN_ID via shared cache
  test/pylib/runner: fix disabled file collection
  test.py: deduplicate CLI test arguments before passing to pytest
  Revert "test.py: fix test collection bug"
2026-04-30 07:58:25 +03:00
Yaniv Michael Kaul
93722f2c89 gms/gossiper: fix use-after-move in do_send_ack2_msg
The second logger.debug() call accesses ack2_msg after it was moved
via std::move() in the co_await send_gossip_digest_ack2 call.
This is undefined behavior.

Fix by formatting ack2_msg to a string before the move, then using
that cached string in both debug log calls.

FIXES: https://scylladb.atlassian.net/browse/SCYLLADB-1778

Closes scylladb/scylladb#29227
2026-04-30 07:07:39 +03:00
Wojciech Mitros
ebaf536449 replica/database: fix cross-shard deadlock in lock_tables_metadata()
lock_tables_metadata() acquires a write lock on tables_metadata._cf_lock
on every shard.  It used invoke_on_all(), which dispatches lock
acquisitions to all shards in parallel via parallel_for_each +
smp::submit_to.

When two fibers call lock_tables_metadata() concurrently, this can
deadlock.  parallel_for_each starts all iterations unconditionally:
even when the local shard's lock attempt blocks (because the other
fiber already holds it), SMP messages are still sent to remote shards.
Both fibers' lock-acquisition messages land in the per-shard SMP
queues.  The SMP queue itself is FIFO, but process_incoming() drains
it and schedules each item as a reactor task via add_task(), which —
in debug and sanitize builds with SEASTAR_SHUFFLE_TASK_QUEUE — shuffles
each newly added task against all pending tasks in the same scheduling
group's reactor task queue.  This means fiber A's lock acquisition can
be reordered past fiber B's (and past unrelated tasks) on a given shard.
If fiber A wins the lock on shard X while fiber B wins on shard Y, this
creates a classic cross-shard lock-ordering deadlock (circular wait).

In production builds without SEASTAR_SHUFFLE_TASK_QUEUE, the reactor
task queue is FIFO. Still, even in release builds, the SMP queues can
reorder messages even, so the deadlock is still possible, even if it's
much less likely. In debug and sanitize builds, the task-queue shuffle
makes the deadlock very likely whenever both fibers' lock-acquisition
tasks are pending simultaneously in the reactor task queue on any shard.

This deadlock was exposed by ce00d61917 ("db: implement large_data
virtual tables with feature flag gating", merged as 88a8324e68),
which introduced legacy_drop_table_on_all_shards as a second caller
of lock_tables_metadata().  When LARGE_DATA_VIRTUAL_TABLES is enabled
during topology_state_load (via feature_service::enable), two fibers
can race:

  1. activate_large_data_virtual_tables() — calls
     legacy_drop_table_on_all_shards() which calls
     lock_tables_metadata() synchronously via .get()

  2. reload_schema_in_bg() — fires as a background fiber from
     TABLE_DIGEST_INSENSITIVE_TO_EXPIRY, eventually reaches
     schema_applier::commit() which also calls lock_tables_metadata()

If both reach lock_tables_metadata() while the lock is free on all
shards, the parallel acquisition creates the deadlock opportunity.
The deadlock blocks topology_state_load() from completing, which
prevents the bootstrapping node from finishing its topology state
transitions.  The coordinator's topology coordinator then waits for
the node to reach the expected state, but the node is stuck, so
eventually the read_barrier times out after 300 seconds.

Fix by acquiring the shard 0 lock first before attempting to
acquire any other lock. Whichever fiber wins shard 0 is
guaranteed to acquire all remaining shards before the other fiber
can proceed past shard 0, eliminating the circular-wait condition.

Tested manually with 2 approaches:
1. causing different shard locks to be acquired by different
lock_tables_metadata() calls by adding different sleeps depending
on the lock_tables_metadata() call and target shard - this reproduced
the issue consistently
2. matching the time point at which both fibers reach lock_tables_metadata()
adding a single sleep to one of the fibers - this heavily depends on
the machine so we can't create a universal reproducer this way, but
it did result in the observed failure on my machine after finding the
right sleep time

Also added a unit test for concurrent lock_tables_metadata() calls.

Fixes: SCYLLADB-1694
Fixes: SCYLLADB-1644
Fixes: SCYLLADB-1684

Closes scylladb/scylladb#29678
2026-04-29 21:13:53 +02:00
Patryk Jędrzejczak
15f35577ed Merge 'paxos_state: keep prepared message alive across statement execution' from Petr Gusev
In do_execute_cql_with_timeout(), when the prepared statement was not found in the cache, we called qp.prepare() and stored the returned result_message::prepared in a local variable scoped to the 'if' block. We then extracted ps_ptr (a checked_weak_ptr to the prepared statement) from the message, let the message go out of scope at the end of the 'if', and used ps_ptr after a co_await on st->execute().

Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in PREPARE result"), result_message::prepared owns a strong pinned reference to the prepared cache entry. While qp.prepare() runs it also holds its own pin on the entry, so on return the entry has at least the pin owned by the returned message. As long as that message is alive, the cache entry cannot be purged and the weak handle inside ps_ptr remains promotable.

The lifetime gap manifested only in debug builds. qp.prepare() returns a ready future on the cache-miss path, so in release builds the co_await resumes synchronously: control flows from the assignment of ps_ptr straight into st->execute() with no opportunity for any other task (in particular, prepared cache invalidation triggered by a concurrent schema change) to run in between. Debug builds, however, force a reactor preemption point on every co_await even when the awaited future is ready. With prepared_msg already destroyed at the end of the 'if' block, the only remaining handle on the cache entry was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a node restart - the chance to drop the entry. The subsequent execute() then failed when promoting the weak pointer with
checked_ptr_is_null_exception.

The exception propagated out of the Paxos prepare path as a generic std::exception with no type information in the log, surfacing on the coordinator as:

  WriteFailure: Failed to prepare ballot ... Replica errors:
  host_id ... -> seastar::rpc::remote_verb_error (std::exception)

Hoist the result_message::prepared into the outer scope so the pinned cache entry stays alive across co_await st->execute(...), closing the window in which a concurrent cache purge could invalidate the weak handle.

Fixes SCYLLADB-1173

backport: the patch is simple, we can backport it to all versions with "LWT over tablets" feature. Note that the problem is only in test runs in debug configuration, production is not affected.

Closes scylladb/scylladb#29675

* https://github.com/scylladb/scylladb:
  table_helper: retry insert prepare on concurrent cache invalidation
  paxos_state: keep prepared message alive across statement execution
2026-04-29 17:57:27 +02:00
Yaron Kaikov
d310e4b27d scylla-gdb: fix compaction-tasks command for intrusive list
Since commit e942c074f2 changed _tasks from std::list<shared_ptr<...>>
to a boost::intrusive_list, iterating yields raw compaction_task_executor
objects rather than shared_ptr wrappers. The GDB script was updated to
use intrusive_list() but still wrapped elements in seastar_shared_ptr(),
causing 'gdb.error: There is no member or method named _p' when
compaction tasks are active.

Move the seastar_shared_ptr unwrapping to the 6.2 compatibility
fallback path only, since the intrusive list path yields objects
directly.

Fixes: SCYLLADB-1762

Closes scylladb/scylladb#29690
2026-04-29 13:11:13 +03:00
Marcin Maliszkiewicz
45b4834ac4 Merge 'audit: fix maintenance socket startup/shutdown ordering' from Andrzej Jackowski
This series addresses three problems in the audit startup/shutdown
sequence:
1. [BUG] Shutdown SIGABRT. During graceful shutdown, deferred stops run in reverse order of construction. With the audit service constructed after the maintenance socket, audit was destroyed first, and in-flight queries on the maintenance socket could hit the destroyed audit service (assertion failure in sharded::local()).
2. [BUG] Startup audit bypass. The maintenance socket opened before audit storage was initialized, allowing queries (e.g. creating a superuser) to bypass auditing in that window.
3. [PROBLEM] Blocks SCYLLADB-1430. The existing order prevents audit configuration from being driven by group0 state, because audit started before group0.

The series is organized as: a test-helper refactor, a test for the audited maintenance-socket flow, a startup-phase split, the construction-order fix and its shutdown-race test, and finally the storage-before-socket fix and its startup-window test.

Fixes SCYLLADB-1615

No backport, bugs don't seem severe enough to justify backporting.

Closes scylladb/scylladb#29539

* github.com:scylladb/scylladb:
  audit: assert storage ordering invariants at runtime
  audit: start maintenance socket after audit storage
  audit: move audit construction before maintenance socket
  audit: split startup into construction and storage phases
  test: audit: verify maintenance socket operations are audited
  test: audit: parameterize source address in audit assertions
2026-04-29 10:37:38 +02:00
Łukasz Paszkowski
7e14ea5ac8 sstables: only wipe TemporaryHashes for sstable formats that have it
Commit 8d34127684 ("sstables: clean up TemporaryHashes file in wipe()")
unconditionally calls filename(..., component_type::TemporaryHashes)
inside filesystem_storage::wipe(). However, the TemporaryHashes
component is only registered in the component map of the 'ms' sstable
format. For older formats (ka, la, mc, md, me) the lookup goes through
sstable_version_constants::get_component_map(version).at(...) and throws
std::out_of_range.

The exception is then swallowed by the outer catch(...) in wipe(), which
just logs and ignores. As a side effect, the subsequent
remove_file(new_toc_name) is never reached and the TemporaryTOC
('*-TOC.txt.tmp') file is left as an orphan on disk after every unlink()
of a non-'ms' sstable.

Guard the lookup with get_component_map(version).contains() so the
cleanup is only attempted for formats that actually define the
component.

Add a regression test in test/boost/sstable_directory_test.cc that
creates an 'me'-format sstable, unlinks it and asserts that the sstable
directory is left empty. Without the fix the test fails with a leftover
'me-...-TOC.txt.tmp' file.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1697

Closes scylladb/scylladb#29620
2026-04-29 08:06:36 +03:00
Botond Dénes
809f12f988 Merge 'test/cluster/dtest: fix ScyllaNode state not persisting across nodelist() calls' from Benny Halevy
`ScyllaCluster.nodelist()` creates new `ScyllaNode` objects on every call,
so per-node state set via `set_smp()`, `set_log_level()`, and
`_adjust_smp_and_memory()` was lost. This meant `set_smp()` had no effect
when `cluster.start()` was called after it, since `start_nodes()` calls
`nodelist()` internally which creates fresh nodes with default values.

- Add debug logging for smp/memory in ScyllaNode
- Store per-node settings (smp, memory, log levels) in a
  `ScyllaCluster._node_resources` dict keyed by server_id, so they survive
  `nodelist()` reconstruction. `ScyllaNode` restores its state from this dict
  on construction and saves it back whenever `set_smp()`, `set_log_level()`,
  or `_adjust_smp_and_memory()` modifies it.
- Add a reproducer test verifying `set_smp()` takes effect on restart

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1629

--

No backport needed: this only fixes dtest infrastructure, no production code
is affected.

Closes scylladb/scylladb#29549

* github.com:scylladb/scylladb:
  test/cluster/dtest: add test for node.set_smp() persistence
  test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
  test/cluster/dtest/ccmlib/scylla_node: add debug logging
2026-04-29 06:25:36 +03:00
Evgeniy Naydanov
96d3f13245 test: add --keep-duplicates and assign RUN_ID via shared cache
Add --keep-duplicates CLI argument to bypass deduplication and forward
to pytest, allowing duplicate test file arguments to be collected
multiple times.

Move RUN_ID assignment from pytest_collect_file to modify_pytest_item.
All File collectors for the same source file share a single run_ids
dict (via RUN_ID_CACHE stash key), so items from duplicate collection
arguments (e.g. with --keep-duplicates) automatically get unique IDs.

Remove CppFile.run_id cached_property — CppTestCase now reads RUN_ID
from its own item stash, which is set during modify_pytest_item.

Fix --repeat option default from string "1" to int 1 — argparse only
applies type= to CLI-parsed values, not defaults.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
497bd6b6c9 test/pylib/runner: fix disabled file collection
Return a DisabledFile collector instead of an empty list when all modes
are disabled for a file.  Returning an empty list caused subsequent
files to not get their stash items set because file_path was never
removed from REPEATING_FILES.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
43f06ed19d test.py: deduplicate CLI test arguments before passing to pytest
Backport the argument normalization algorithm from pytest 9.0.3 to
work around broken deduplication in pytest 8.3.5
(https://github.com/pytest-dev/pytest/issues/12083).

Duplicate or subsumed test paths (e.g. 'test/cql' and
'test/cql/lua_test.cql') are now collapsed before invoking pytest.

Revert this commit when upgrading to pytest 9.x.

Co-Authored-By: Claude Opus 4.6 (200K context) <noreply@anthropic.com>
2026-04-29 02:36:05 +00:00
Evgeniy Naydanov
05f2c53931 Revert "test.py: fix test collection bug"
This reverts commit 92c09d106d.
2026-04-29 02:35:00 +00:00
Andrzej Jackowski
3755c370ac audit: assert storage ordering invariants at runtime
Abort if audit storage fails to start rather than silently
running with an unaudited maintenance socket. Also assert
that storage is already stopped when the audit service is
destroyed, documenting the defer-stack ordering requirement.

Refs SCYLLADB-1615
Refs SCYLLADB-1695
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
543fb6a2db audit: start maintenance socket after audit storage
Without this, there is a window after startup where queries on
the maintenance socket bypass auditing because audit storage
is not yet initialized.

Fixes SCYLLADB-1615
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
b7bc2d89e6 audit: move audit construction before maintenance socket
During graceful shutdown, deferred stops run in reverse order of
construction.  When the audit service was constructed after the
maintenance socket, audit was destroyed first.  A DML query
still in-flight on the maintenance socket could then bypass
auditing entirely.

Move construction as early as possible so the audit service
outlives the maintenance socket on the defer stack, and to
maximise the window in which attempts to use audit before
storage is ready are caught with on_internal_error_noexcept.

Refs SCYLLADB-1615
2026-04-28 18:58:49 +02:00
Andrzej Jackowski
bc67dd0b82 audit: split startup into construction and storage phases
The table-based audit backend needs Raft to create its keyspace,
but the audit service must exist earlier so that CQL paths don't
silently skip auditing.

Split startup into two phases: construction and storage
initialization.  Queries arriving between the two phases are
logged as errors.

This is a refactoring commit and the split sections will be
moved later in this patch series.

Refs SCYLLADB-1615
2026-04-28 18:58:42 +02:00
Andrzej Jackowski
1616c71bf0 test: audit: verify maintenance socket operations are audited
User creation via the maintenance socket should produce audit
entries, as this is the recommended flow for creating the
initial superuser when default credentials are disabled.

The test is parametrized by audit backend (table and syslog).
The maintenance socket source address is "::" because Seastar
returns a zero-initialised in6_addr for AF_UNIX sockets.

Test time in dev: 0.6s

Refs SCYLLADB-1615
2026-04-28 18:42:39 +02:00
Avi Kivity
c4de2b3c9d Merge 'test: fix flaky tablets test by using read barrier' from Aleksandra Martyniuk
Some tests in test_tablets.py read system_schema.keyspaces from an arbitrary node that may not have applied the latest schema change yet. Pin the read to a specific node and issue a read barrier before querying, ensuring the node has up-to-date data.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1700

Test fix; no backport

Closes scylladb/scylladb#29655

* github.com:scylladb/scylladb:
  test: fix flaky rack list conversion tests by using read barrier
  test: fix flaky test_enforce_rack_list_option by using read barrier
2026-04-28 17:15:59 +03:00
Petr Gusev
e6137ab11b table_helper: retry insert prepare on concurrent cache invalidation
table_helper::insert() retrieves the prepared statement via
cache_table_info() and then dereferences _prepared_stmt to read
bound_names. _prepared_stmt is a checked_weak_ptr into the prepared
statements cache and can be invalidated at any time by a concurrent
purge (for example, on a schema change).

cache_table_info() (re-)prepares the statement and assigns
_prepared_stmt before returning, and the strong pin held by the
result_message::prepared returned from qp.prepare() keeps the cache
entry alive only for the duration of try_prepare(). After try_prepare()
returns, the pin is gone and _prepared_stmt is the only remaining
handle on the entry.

In release builds this is fine: the chain of ready-future co_awaits
between try_prepare() finishing and _prepared_stmt->bound_names being
read resumes synchronously, so no other task -- in particular, no
cache purge -- can run in that window. In debug builds, however,
Seastar inserts a reactor preemption point on every co_await even when
the awaited future is ready. That preemption window is wide enough for
a concurrent invalidation to drop the freshly installed cache entry,
turning _prepared_stmt into a null weak handle and crashing the
subsequent dereference with checked_ptr_is_null_exception.

Wrap the cache_table_info() call in a loop that re-attempts the
preparation until a synchronous post-resume check finds _prepared_stmt
still valid. The check runs in the same task immediately after the
co_await resumes, with no co_await between the check and the
dereference, so a purge cannot slip in. _insert_stmt is a strong
shared_ptr to the statement object and is not affected by cache
invalidation, so it remains safe to use across the final co_await on
execute().

The other caller of cache_table_info(),
trace_keyspace_helper::apply_events_mutation(), accesses only the
strong _insert_stmt via insert_stmt() and never dereferences the weak
_prepared_stmt, so it is unaffected.

Refs SCYLLADB-1173
2026-04-28 16:03:06 +02:00
Ernest Zaslavsky
a97502920b test: optimize compaction_strategy_cleanup_method for remote storage
Parallelize SSTable creation using parallel_for_each. The file
count is made a parameter with a default of 64, allowing future
S3/GCS variants to use a smaller count if needed.
2026-04-28 16:59:38 +03:00
Ernest Zaslavsky
0b9a2844bd test: optimize stcs_reshape_overlapping for remote storage
Parallelize SSTable creation using parallel_for_each and reduce
the SSTable count from 256 to 64 for S3/GCS variants. The local
test variant retains the original 256 count.
2026-04-28 16:59:38 +03:00
Ernest Zaslavsky
ac89cffc9f test: optimize twcs_reshape_with_disjoint_set for remote storage
Parallelize SSTable creation across all sub-tests using
parallel_for_each and reduce the SSTable count from 256 to 64 for
S3/GCS variants.
Re-enable the S3 test variant that was previously disabled due to
taking 4+ minutes. With parallel creation and reduced count, the
test now completes in a reasonable time.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
01b4292f87 test: parallelize SSTable creation in cleanup_during_offstrategy_incremental
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
923ff9abc9 test: parallelize SSTable creation in run_incremental_compaction_test
Pre-extract mutation pairs and use parallel_for_each with
make_sstable_containing_async to create SSTables concurrently
instead of sequentially. The post-creation loop still runs serially
to collect token ranges and generations that depend on SSTable order.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
6a25f52473 test: parallelize SSTable creation in offstrategy_sstable_compaction
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
baca685629 test: parallelize SSTable creation in twcs_partition_estimate
Use parallel_for_each with make_sstable_containing_async to create
SSTables concurrently instead of sequentially, reducing wall-clock
time on remote storage backends (S3/GCS).
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
716202b839 test: add trace-level logging for S3 and HTTP in compaction tests
Raise log levels for s3 and gcp_storage from debug to trace, and add
trace-level logging for http and default_http_retry_strategy modules.
This provides better visibility into storage backend interactions
when debugging slow or failing compaction tests on remote storage.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
a4ebe16517 test: make sstable test utilities natively async
The original make_memtable used seastar::thread::yield() for
preemption, which required all callers to run inside a
seastar::thread context. This prevented the utilities from being
used directly in coroutines or parallel_for_each lambdas.
Make the primary functions — make_memtable, make_sstable_containing,
and verify_mutation — return future<> directly. Callers now .get()
explicitly when in seastar::thread context, or co_await when in
a coroutine.
make_memtable now uses coroutine::maybe_yield() instead of
seastar::thread::yield(). verify_mutation is converted to
coroutines as well.
Requested in:
https://github.com/scylladb/scylladb/pull/29416#pullrequestreview-4112296282
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
4b637226a7 test: move make_memtable out of external_updater in row_cache_test
test_exception_safety_of_update_from_memtable called make_memtable
inside the row_cache::external_updater callback. external_updater
runs as a synchronous execute() call that must not yield, but
make_memtable calls seastar::thread::yield() every 10th mutation.

The bug was latent because the test only inserted 5 mutations, so
the yield was never reached. Move the call before the callback.

Prerequisite for the next patch, which changes make_memtable to
call make_memtable_async().get() -- that would yield on every
mutation via coroutine::maybe_yield(), making this bug visible.
2026-04-28 16:59:37 +03:00
Ernest Zaslavsky
7c09f35ddf test: increase S3 max connections for compaction tests
Increase max_connections from the default to 32 for the S3 endpoint
used in tests. This allows more concurrent HTTP connections to the S3
backend, which is needed to benefit from parallel SSTable creation
that will be introduced in subsequent commits.
2026-04-28 16:59:37 +03:00
Patryk Jędrzejczak
d9dd3bfe53 Merge 'topology_coordinator: join tablet load stats refresh in stop()' from Andrzej Jackowski
Commit 2b7aa32 (topology_coordinator: Refresh load stats after
table is created or altered) registered topology_coordinator as a
schema change listener and added on_create_column_family which
fire-and-forgets _tablet_load_stats_refresh.trigger(). The
triggered task runs on the gossip scheduling group via
with_scheduling_group and accesses the topology_coordinator via
'this'.

stop() unregisters the listener but does not wait for any
in-flight refresh task. If a notification fires between
_tablet_load_stats_refresh.join() in run() and unregister_listener
in stop(), the scheduled task can outlive the topology_coordinator
and access freed memory after run_topology_coordinator's coroutine
frame is destroyed.

Wait for the refresh to complete in stop() after unregistering the
listener, ensuring no task can fire after destruction.

Fixes SCYLLADB-1728

Backport to 2026.1 and 2026.2, because the issue was introduced in 2b7aa32

Closes scylladb/scylladb#29653

* https://github.com/scylladb/scylladb:
  test: tablet_stats: reproduce shutdown refresh race
  topology_coordinator: join tablet load stats refresh in stop()
2026-04-28 12:54:28 +02:00
Benny Halevy
5eaa979f35 test/cluster/dtest: add test for node.set_smp() persistence
Add a test that reproduces SCYLLADB-1629: set_smp() had no effect
because nodelist() created new ScyllaNode objects on every call,
losing the _smp_set_during_test value. The test fails without the
fix in the previous patch.
2026-04-28 12:34:08 +03:00
Benny Halevy
7430c1efd7 test/cluster/dtest: cache ScyllaNode instances in ScyllaCluster
ScyllaCluster.nodelist() was creating new ScyllaNode objects on every
call, so per-node state set via set_smp(), set_log_level(), and
_adjust_smp_and_memory() was lost between calls.

Fix by caching ScyllaNode instances in a list populated by
_add_nodes() using the list returned by servers_add() in populate().
Nodes are assigned monotonically increasing names (node1, node2, ...).
nodelist() simply returns the cached list.
2026-04-28 12:34:06 +03:00
Marcin Maliszkiewicz
b0f988afc4 Merge 'auth: fix shutdown and startup races in LDAP cache pruner' from Andrzej Jackowski
The LDAP role manager's `_cache_pruner` background fiber periodically calls cache::reload_all_permissions(). Two races cause it to hit SCYLLA_ASSERT(_permission_loader):
- Cross-shard race: The pruner `used _cache.container().invoke_on_all()` to reload permissions on every shard. Since both `service::start()` and `sharded<service>::stop()` execute per-shard in parallel, the pruner on one shard could call reload_all_permissions() on another shard before that shard set its loader (startup) or after it cleared its loader (shutdown). Each shard runs its own pruner instance, so reloading locally is sufficient — this also removes redundant N² reload calls.
- Intra-shard race: `service::stop()` cleared the permission loader and stopped the role manager concurrently (via when_all_succeed). A mid-reload pruner could yield and then call the now-null loader. Fixed by stopping the role manager first so the pruner is fully drained before the loader is cleared.

Fixes SCYLLADB-1679
Backport to 2026.2, introduced in 7eedf50c12

Closes scylladb/scylladb#29605

* github.com:scylladb/scylladb:
  auth: make shutdown the exact reverse of startup
  test: ldap: add test for pruner crash during shutdown
  auth: start authorizer and set permission loader before role manager
  auth: stop role manager before clearing permission loader
  auth: reload LDAP permission cache on local shard only
2026-04-28 11:16:07 +02:00
Botond Dénes
a7e9c0e6d2 Merge 'test.py: fix test collection bug' from Andrei Chekun
In certain circumstances current way of collecting can be error-prone. Collection can stop when the first file is skipped in the mode leaving the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file explicitly, it will produce incorrect CppFile in the stash causing KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714

No backport, test framework bug fix only.

Closes scylladb/scylladb#29634

* github.com:scylladb/scylladb:
  test.py: fix framework test
  test.py: fix test collection bug
2026-04-28 11:52:35 +03:00
Petr Gusev
e39267b55f paxos_state: keep prepared message alive across statement execution
In do_execute_cql_with_timeout(), when the prepared statement was not
found in the cache, we called qp.prepare() and stored the returned
result_message::prepared in a local variable scoped to the 'if' block.
We then extracted ps_ptr (a checked_weak_ptr to the prepared statement)
from the message, let the message go out of scope at the end of the
'if', and used ps_ptr after a co_await on st->execute().

Since 3ac4e258e8 ("transport/messages: hold pinned prepared entry in
PREPARE result"), result_message::prepared owns a strong pinned
reference to the prepared cache entry. While qp.prepare() runs it also
holds its own pin on the entry, so on return the entry has at least
the pin owned by the returned message. As long as that message is
alive, the cache entry cannot be purged and the weak handle inside
ps_ptr remains promotable.

The lifetime gap manifested only in debug builds. qp.prepare() returns
a ready future on the cache-miss path, so in release builds the
co_await resumes synchronously: control flows from the assignment of
ps_ptr straight into st->execute() with no opportunity for any other
task (in particular, prepared cache invalidation triggered by a
concurrent schema change) to run in between. Debug builds, however,
force a reactor preemption point on every co_await even when the
awaited future is ready. With prepared_msg already destroyed at the
end of the 'if' block, the only remaining handle on the cache entry
was the weak ps_ptr, and the preemption gave a concurrent cache purge
- triggered, for example, by Raft schema changes received during a
node restart - the chance to drop the entry. The subsequent execute()
then failed when promoting the weak pointer with
checked_ptr_is_null_exception.

The exception propagated out of the Paxos prepare path as a generic
std::exception with no type information in the log, surfacing on the
coordinator as:

  WriteFailure: Failed to prepare ballot ... Replica errors:
  host_id ... -> seastar::rpc::remote_verb_error (std::exception)

Hoist the result_message::prepared into the outer scope so the pinned
cache entry stays alive across co_await st->execute(...), closing the
window in which a concurrent cache purge could invalidate the weak
handle.

Fixes SCYLLADB-1173
2026-04-28 10:42:13 +02:00
Botond Dénes
3ea4af1c8c Merge 'test/cluster/test_incremental_repair: fix flaky coordinator-change scenario' from Avi Kivity
- Ensure servers[1] is not the topology coordinator before restarting it, preventing the leader death + re-election + re-repair sequence that masked the compaction-merge bug
- Add a retry loop that detects post-restart leadership transfer to servers[1] via direct coordinator query, retrying up to 5 times

Fixes: SCYLLADB-1478

Backporting to 2026.2, which sees the failure regularly.

Closes scylladb/scylladb#29671

* github.com:scylladb/scylladb:
  test/cluster/test_incremental_repair: add retry for residual leadership race
  test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
2026-04-28 09:05:02 +03:00
Andrzej Jackowski
459e3970cd test: tablet_stats: reproduce shutdown refresh race
The coordinator can receive a schema-change notification after run()
finishes but before stop() unregisters listeners. The test pins that
window with error injections and verifies stop() waits for the refresh
instead of letting it outlive the coordinator.

Test time in dev: 9.51s

Refs SCYLLADB-1728
2026-04-28 08:00:54 +02:00
Andrzej Jackowski
8756f7c068 topology_coordinator: join tablet load stats refresh in stop()
Commit 2b7aa3211d made schema changes trigger tablet load stats
refreshes in the background. A notification can still arrive after
run() stops the periodic refresher and before the coordinator object
is destroyed.

Move lifecycle subscription cleanup to stop() and join the serialized
refresh there after unregistering refresh trigger sources. This keeps
the coordinator alive until notification-triggered refresh work has
completed.

Fixes SCYLLADB-1728
2026-04-28 07:37:28 +02:00
Avi Kivity
2615d0e8d8 test/cluster/test_incremental_repair: add retry for residual leadership race
There is a small race window where Raft leadership could transfer back
to servers[1] between the ensure_group0_leader_on() check and the
actual restart.  If this happens, the new coordinator re-initiates
repair and masks the compaction-merge bug.

Extract the core test logic into _do_race_window_promotes_unrepaired_data()
which directly checks get_topology_coordinator() after restart and raises
_LeadershipTransferred if servers[1] became coordinator.  The test
function calls this helper in a retry loop (up to 5 attempts).

Refs: SCYLLADB-1478
2026-04-27 21:11:06 +03:00
Avi Kivity
914b70c75b test/cluster/test_incremental_repair: fix flaky coordinator-change scenario
The test_incremental_repair_race_window_promotes_unrepaired_data test
was flaky because it hardcodes servers[1] as the restart target but did
not ensure servers[1] was NOT the topology coordinator.

When servers[1] happened to be the Raft group0 leader (topology
coordinator), restarting it killed the leader, forced a new election,
and the new coordinator re-initiated tablet repair.  This re-repair
flushes memtables on all replicas via take_storage_snapshot() and marks
the resulting sstables as repaired -- causing post-repair keys to appear
in repaired sstables on servers[0] and servers[2].  The test then hit
the wrong assertion (servers[0]/[2] contaminated).

Fix: before starting the repair, check whether servers[1] is the
topology coordinator.  If so, move leadership to another server via
ensure_group0_leader_on() so that restarting servers[1] only kills a
follower -- which does not trigger an election or coordinator change.

Reproducibility was confirmed by forcing leadership to servers[1] via
ensure_group0_leader_on() and observing deterministic failure with all
three servers showing post-repair keys in repaired sstables (confirming
the re-repair scenario), then verifying the fix passes reliably.

Fixes: SCYLLADB-1478
2026-04-27 21:08:12 +03:00
Aleksandra Martyniuk
6b7ce5e244 test: fix flaky rack list conversion tests by using read barrier
test_numeric_rf_to_rack_list_conversion and
test_numeric_rf_to_rack_list_conversion_abort were reading
system_schema.keyspaces from an arbitrary node that may not have
applied the latest schema change yet. Pin the read to a specific node
and issue a read barrier before querying, ensuring the node has
up-to-date data.
2026-04-27 15:19:09 +02:00
Aleksandra Martyniuk
9d3d424d58 test: fix flaky test_enforce_rack_list_option by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.
2026-04-27 14:44:38 +02:00
Ferenc Szili
6b3e18c4a9 test: verify load balancer handles dropped tables gracefully
Add test_load_balancing_with_dropped_table that simulates the race between
DROP TABLE and the load balancer by capturing a token metadata snapshot
before dropping the table, then passing the stale snapshot to
balance_tablets(). Verifies it completes without aborting and produces no
migrations for the dropped table.
2026-04-27 10:33:56 +02:00
Ferenc Szili
4987204f71 tablet_allocator: handle dropped tables gracefully in get_schema_and_rs
The load balancer's get_schema_and_rs() would trigger on_internal_error when
a table present in the token metadata snapshot had been concurrently dropped
from the live schema. This race is possible because the balancer coroutine
yields between building the candidate list and checking replication
constraints, allowing a DROP TABLE schema mutation to be applied by another
fiber in the meantime.

Change get_schema_and_rs() to return {nullptr, nullptr} for dropped tables
instead of aborting. Update all callers to skip dropped tables:
- make_sizing_plan: continue to next table
- make_resize_plan: continue to next table (merge suppression is moot)
- check_constraints: return skip_info with empty viable targets
- get_rs: return nullptr, checked by check_constraints
2026-04-27 10:33:53 +02:00
Anna Mikhlin
86472e43e1 Update ScyllaDB version to: 2026.3.0-dev 2026-04-26 15:30:13 +03:00
Andrei Chekun
f2f4915e09 test.py: fix framework test
Framework test was not skipping unit directory where C++ tests are
located. With bug fixing this started to fail. Add ignoring this
directory as well.
2026-04-25 18:04:55 +02:00
Piotr Szymaniak
d5efd1f676 test/cluster: wait for Alternator readiness in server startup
server_add() only waits for CQL readiness before returning. The
Alternator HTTP port may not be listening yet, causing
ConnectionRefused with Alternator tests.

Extend the ServerUpState enum and startup loop to also check Alternator
port readiness when configured. Whenever Alternator port(s) is/are
configured, each is verified if connectable and queryable,
similar to how CQL ports are probed.

Fixes SCYLLADB-1701

Closes scylladb/scylladb#29625
2026-04-25 16:35:44 +03:00
Piotr Smaron
d14d07a079 test: fix flaky test_sstable_write_large_{row,cell} by using a fixed partition key
Commit ce00d61917 ("db: implement large_data virtual tables with feature
flag gating") changed these two tests to construct their mutation with
a randomly generated partition key (simple_schema::make_pkey()) instead
of the previously fixed pk "pv", with the comment that this avoids a
"Failed to generate sharding metadata" error.

simple_schema::make_pkey() delegates to tests::generate_partition_key(),
which defaults to key_size{1, 128}, i.e. the partition key length is
uniformly random in [1, 128] bytes. That interacts badly with the fact
that both tests pick thresholds at exact byte boundaries of the MC
sstable row encoding:

  - The large-data handler records a row's size as
      _data_writer->offset() - current_pos
    (sstables/mx/writer.cc: collect_row_stats()), i.e. the number of
    bytes the row took on disk.
  - For the first clustering row, the body includes a vint-encoded
    prev_row_size = pos - _prev_row_start.
  - _prev_row_start is captured at the start of the partition
    (consume_new_partition()) before the partition key is written to
    the data stream, so prev_row_size rolls in the partition key's
    serialized length (2-byte prefix + pk bytes) + deletion_time +
    static row size.

A random-size partition key therefore perturbs the first clustering
row's encoded size by 1-2 bytes across runs (the vint of prev_row_size
crosses the 128 boundary), flipping the test's byte-exact threshold
comparison. On seed 2104744000 this produced:

  critical check row_size_count == expected.size() has failed [3 != 2]

Fix the two byte-exact-sensitive tests by reverting their partition key
to the fixed s.new_mutation("pv") used before ce00d61917. Under smp=1
(which these tests run with, per -c1 in the test invocation) a fixed
key is always shard-local, so no sharding-metadata issue arises here.

The other tests modified by ce00d61917 (test_sstable_log_too_many_rows,
test_sstable_log_too_many_dead_rows, test_sstable_too_many_collection_elements,
test_large_data_records_round_trip, etc.) assert on row/element counts
or use thresholds with enough slack that the partition key size does
not matter, and are left unchanged.

Add an explanatory comment to each fixed site so the pitfall is not
re-introduced by a future refactor.

Verified stable with:
  ./test.py --mode=dev     test/boost/sstable_3_x_test.cc::test_sstable_write_large_row  --repeat 100 --max-failures 1
  ./test.py --mode=dev     test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1
  ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_row  --repeat 100 --max-failures 1
  ./test.py --mode=release test/boost/sstable_3_x_test.cc::test_sstable_write_large_cell --repeat 100 --max-failures 1

All four invocations: 100/100 passed.

Fixes: SCYLLADB-1685

Closes scylladb/scylladb#29621
2026-04-25 16:32:02 +03:00
Andrei Chekun
92c09d106d test.py: fix test collection bug
In certain circumstances current way of collecting can be error prone.
Collection can stop when the first file is skipped in the mode leaving
the rest of the files in CLI not collected.
Another issue that if the file specified twice, with directory and file
explicitly, it will produce incorrect CppFile in the stash causing
KeyError.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1714
2026-04-24 17:57:11 +02:00
Andrzej Jackowski
8855e77465 auth: make shutdown the exact reverse of startup
The previous parallel stop of the authenticator and authorizer
was a micro-optimization that obscured the lifecycle invariant
that shutdown should reverse startup.

Refs SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
adf1e26bab test: ldap: add test for pruner crash during shutdown
Verify that service::stop() drains the LDAP pruner before
clearing the permission loader. The test installs a slow
permission loader and confirms the pruner is actively
reloading when teardown begins.

Refs SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
37a547604f auth: start authorizer and set permission loader before role manager
LDAP role manager starts a pruner fiber that calls
reload_all_permissions() which asserts _permission_loader is set.
The permission loader calls _authorizer->authorize(), so the
authorizer must be started before the loader is set.

Start authorizer, then set the permission loader, then start the
role manager, ensuring both dependencies are satisfied before the
pruner can fire.

Fixes SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
c3e5285d45 auth: stop role manager before clearing permission loader
service::stop() cleared the permission loader and stopped
the role manager concurrently (via when_all_succeed). The
LDAP pruner could be mid-reload at a yield point when the
loader was set to null, causing it to call a null function.

Stop the role manager first so the pruner is fully drained
before the loader is cleared.

Fixes SCYLLADB-1679
2026-04-24 13:34:09 +02:00
Andrzej Jackowski
f75e5ac65b auth: reload LDAP permission cache on local shard only
The LDAP role manager's _cache_pruner fiber used
invoke_on_all() to reload permissions on every shard.
Since auth::service::start() runs on all shards in
parallel via invoke_on_all(), the pruner on shard X
could call reload_all_permissions() on shard Y before
shard Y finished start() and set its permission loader,
hitting SCYLLA_ASSERT(_permission_loader). The same
cross-shard race existed during shutdown.

Each shard runs its own pruner instance, so reloading
locally is sufficient — all shards are still covered.
This also removes redundant N-squared reload calls.

Refs SCYLLADB-1679
2026-04-24 13:06:58 +02:00
Pavel Emelyanov
111165d9de view: Turn calculate_view_update_throttling_delay into node_update_backlog member
The free function calculate_view_update_throttling_delay() took the
view_flow_control_delay_limit_in_ms as a parameter, which forced its
two callers (storage_proxy and view_update_generator) to fish the
option out of db::config via database::get_config(). Now that the
option lives on node_update_backlog, make the throttling calculation a
member of node_update_backlog and have the callers invoke it on their
node_update_backlog reference.

This removes two database::get_config() call sites.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:52:12 +03:00
Pavel Emelyanov
855372db3c view: Place view_flow_control_delay_limit_in_ms on node_update_backlog
Store the view_flow_control_delay_limit_in_ms config option as an
updateable_value on node_update_backlog. The value is threaded from
main.cc into the backlog object at construction time. Existing call
sites (tests) that construct node_update_backlog without the option
continue to work via a default argument.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:47:54 +03:00
Pavel Emelyanov
ec2339e635 view: Add node_update_backlog reference to view_update_generator
Pass node_update_backlog explicitly to view_update_generator via its
constructor and start() call. This is plumbing only; no behavior change.
A subsequent patch will use this reference to compute view update
throttling delays without going through database::get_config().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-24 13:45:46 +03:00
Botond Dénes
70261dc674 Merge 'test/cluster: scale failure_detector_timeout_in_ms by build mode' from Marcin Maliszkiewicz
The failure_detector_timeout_in_ms override of 2000ms in 6 cluster test files is too aggressive for debug/sanitize builds. During node joins, the coordinator's failure detector times out on RPC pings to the joining node while it is still applying schema snapshots, marks it DOWN, and bans it — causing flaky test failures.

Scale the timeout by MODES_TIMEOUT_FACTOR (3x for debug/sanitize, 2x for dev, 1x for release) via a shared failure_detector_timeout fixture in conftest.py.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1587
Backport: no, elasticsearch analyser shows only a single failure

Closes scylladb/scylladb#29522

* github.com:scylladb/scylladb:
  test/cluster: scale failure_detector_timeout_in_ms by build mode
  test/cluster: add failure_detector_timeout fixture
2026-04-24 09:10:43 +03:00
Botond Dénes
d280517e27 test/cluster/test_incremental_repair: fix flaky do_tablet_incremental_repair_and_ops
The log grep in get_sst_status searched from the beginning of the log
(no from_mark), so the second-repair assertions were checking cumulative
counts across both repairs rather than counts for the second repair alone.

The expected values (sst_add==2, sst_mark==2) relied on this cumulative
behaviour: 1 from the first repair + 1 from the second = 2. This works
when the second repair encounters exactly one unrepaired sstable, but
fails whenever the second repair sees two.

The second repair can see two unrepaired sstables when the 100 keys
inserted before it (via asyncio.gather) trigger a background auto-flush
before take_storage_snapshot runs. take_storage_snapshot always flushes
the memtable itself, so if an auto-flush already split the batch into two
sstables on disk, the second repair's snapshot contains both and logs
"Added sst" twice, making the cumulative count 3 instead of 2.

Fix: take a log mark per-server before each repair call and pass it to
get_sst_status so each check counts only the entries produced by that
repair. The expected values become 1/0/1 and 1/1/1 respectively,
independent of how many sstables happened to exist beforehand.

get_sst_status gains an optional from_mark parameter (default None)
which preserves existing call sites that intentionally grep from the
start of the log.

Fixes: SCYLLADB-1086

Closes scylladb/scylladb#29484
2026-04-23 17:17:16 +02:00
Wojciech Mitros
7634d3f7d4 test/cluster: fix flaky test_hints_consistency_during_replace
The test creates a sync point immediately after writing 100 rows
with CL=ANY, without waiting for pending hint writes to complete.

store_hint() is fire-and-forget: it submits do_store_hint() to a gate
and returns immediately. do_store_hint() updates _last_written_rp only
after writing to the commitlog. If create_sync_point() is called before
all do_store_hint() coroutines complete, the captured replay position
is stale, and await_sync_point() returns DONE before all hints are
replayed, leaving some rows missing.

Fix by waiting for the size_of_hints_in_progress metric to reach zero
before creating the sync point, ensuring all in-flight hint writes have
completed and _last_written_rp is up to date. This follows the same
pattern already used in test_sync_point.

Fixes: SCYLLADB-1560

Closes scylladb/scylladb#29623
2026-04-23 17:03:48 +02:00
Botond Dénes
b49cf6247f test: fix flaky test_read_repair_with_trace_logging by reading tracing with CL=ALL
Tracing events are written to system_traces.events with CL=ANY, so they
are only guaranteed to be present on the local node of the query
coordinator. Reading them back with the driver default (CL=LOCAL_ONE)
may route the query to a replica that has not yet received all events,
causing the assertion on 'digest mismatch, starting read repair' to fail
intermittently.

Fix execute_with_tracing() to read tracing via the ResponseFuture API
with query_cl=ConsistencyLevel.ALL, so events from all replicas are
merged before the caller inspects them.

Fixes: SCYLLADB-1633

Closes scylladb/scylladb#29566
2026-04-23 16:57:29 +02:00
Michał Jadwiszczak
878f341338 test/cluster/test_view_building_coordinator: fix view_updates_drained predicate
The previous fix for the flakiness in test_file_streaming waited for
the scylla_database_view_update_backlog metric to drop to 0 via
wait_for(view_updates_drained, ...). However, the predicate returned
True/False, while wait_for treats any non-None result as 'done' and
keeps retrying only on None. So when the backlog was non-zero the
predicate returned False, which wait_for interpreted as success and
returned immediately - the test could then stop servers[0]/servers[1]
before the view updates generated by new_server from the migrated
staging sstable were actually delivered, leading to a partially
populated MV (e.g. 431/1000 rows) and a failing assertion.

Fix the predicate to return None instead of False when the backlog is
not yet drained, so wait_for will actually retry until the metric
reaches 0 (or the deadline is hit).

Fixes SCYLLADB-1182

Closes scylladb/scylladb#29587
2026-04-23 17:52:22 +03:00
Andrei Chekun
67b3ad94a0 test.py: enhance error output in case no tests were executed
By default, pytest produces the error if provided file is not exists. But
coupled with xdist it will produce no errors. This is due how the pytest
works with xdist. test.py always uses the parameter -n, so if something
will go wrong there will be no errors produced, only exit code 5 will be
thrown. This PR will print warning in case pytest's exit code is 5.

Closes scylladb/scylladb#29584
2026-04-23 14:03:55 +02:00
Calle Wilund
c97ce32f47 Update position in dma_read(iovec) in create_file_for_seekable_source
Fixes: SCYLLADB-1523

The returned file object does not increment file pos as is. One line fix.
Added test to make sure this read path works as expected.

Closes scylladb/scylladb#29456
2026-04-23 14:54:20 +03:00
Michael Litvak
3468e8de8b test/mv/test_mv_staging: wait for cql after restart
Wait for cql on all hosts after restarting a server in the test.

The problem that was observed is that the test restarts servers[1] and
doesn't wait for the cql to be ready on it. On test teardown it drops
the keyspace, trying to execute it on the host that is not ready, and
fails.

Fixes SCYLLADB-1632

Closes scylladb/scylladb#29562
2026-04-23 12:40:19 +02:00
Benny Halevy
6cb4c27f8c test/cluster/dtest/ccmlib/scylla_node: add debug logging
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-23 09:21:06 +03:00
Andrzej Jackowski
2503546251 test: audit: parameterize source address in audit assertions
Maintenance socket connections report a different source address
than regular CQL connections. Make the source field configurable
in the audit test helpers so that upcoming maintenance socket
tests can verify the correct address.

Also fix the syslog backend address parser to handle IPv6
addresses formatted as [ip]:port.

Refs SCYLLADB-1615
2026-04-23 07:02:02 +02:00
Marcin Maliszkiewicz
3df951bc9c Merge 'audit: set audit_info for native-protocol BATCH messages' from Andrzej Jackowski
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652

No backport - bug introduced recently.

Closes scylladb/scylladb#29570

* github.com:scylladb/scylladb:
  test/audit: add reproducer for native-protocol batch not being audited
  audit: set audit_info for native-protocol BATCH messages
  test/audit: rename internal test methods to avoid CI misdetection
2026-04-22 18:56:28 +02:00
Piotr Szymaniak
9a86044c63 test: Stop providing alternator-streams experimental flag
Now that alternator-streams is no longer an experimental feature,
stop passing it in test configurations.
2026-04-22 15:25:37 +02:00
Piotr Szymaniak
870013b437 alternator: Graduate Alternator Streams from experimental
Alternator Streams were experimental until 2026.2, when they became GA.
Stop requiring `--experimental-features=alternator-streams` by:

- Removing ALTERNATOR_STREAMS from the experimental feature enum
- Mapping "alternator-streams" to UNUSED for backward compatibility
- Removing the gating that disabled the ALTERNATOR_STREAMS gossip
  feature when the experimental flag was absent
- Removing the runtime guard that rejected StreamSpecification requests
  without the feature flag
- Updating config_test to reflect the new UNUSED mapping

The gms::feature alternator_streams is kept for rolling upgrade
compatibility with older nodes.

Fixes SCYLLADB-1680
2026-04-22 15:22:15 +02:00
Botond Dénes
eb3326b417 Merge 'test.py: migrate all bare skips to typed skip markers' from Artsiom Mishuta
should be merged after #29235

Complete the typed skip markers migration started in the plugin PR.
Every bare `@pytest.mark.skip` decorator and `pytest.skip()` runtime call
across the test suite is replaced with a typed equivalent, making skip
reasons machine-readable in JUnit XML and Allure reports.

**62 files changed** across 8 commits, covering ~127 skip sites in total.

Bare `pytest.skip` provides only a free-text reason string. CI dashboards
(JUnit, Allure) cannot distinguish between a test skipped due to a known
bug, a missing feature, a slow test, or an environment limitation. This
makes it hard to track skip debt, prioritize fixes, or filter dashboards
by skip category.

The typed markers (`skip_bug`, `skip_not_implemented`, `skip_slow`,
`skip_env`) introduced by the `skip_reason_plugin` solve this by embedding
a `skip_type` field into every skip report entry.

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_bug` | 24 | 16 | Skip reason references a known bug/issue |
| `skip_not_implemented` | 10 | 5 | Feature not yet implemented in Scylla |
| `skip_slow` | 4 | 3 | Test too slow for regular CI runs |
| `skip_not_implemented` (bare) | 2 | 1 | Bare `@pytest.mark.skip` with no reason (COMPACT STORAGE, #3882) |

| Type | Count | Files | Description |
|------|-------|-------|-------------|
| `skip_env` | ~85 | 34 | Feature/config/topology not available at runtime |
| `skip_bug` | 2 | 2 | Known bugs: Streams on tablets (#23838), coroutine task not found (#22501) |

- **Comments**: 7 comments/docstrings across 5 files updated from `pytest.skip()` to `skip()`
- **Plugin hardened**: `warnings.warn()` → `pytest.UsageError` for bare `@pytest.mark.skip` at collection time — bare skips are now a hard error, not a warning
- **Guard tests**: New `test/pylib_test/test_no_bare_skips.py` with 3 tests that prevent regression:
  - AST scan for bare `@pytest.mark.skip` decorators
  - AST scan for bare `pytest.skip()` runtime calls
  - Real `pytest --collect-only` against all Python test directories

Runtime skip sites use the convenience wrappers from `test.pylib.skip_types`:
```python
from test.pylib.skip_types import skip_env
```

Usage:
```python
skip_env("Tablets not enabled")
```

1. **test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs** — 24 decorator sites, 16 files
2. **test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented** — 10 decorator sites, 5 files
3. **test: migrate @pytest.mark.skip to @pytest.mark.skip_slow** — 4 decorator sites, 3 files
4. **test: migrate bare @pytest.mark.skip to skip_not_implemented** — 2 bare decorators, 1 file
5. **test: migrate runtime pytest.skip() to typed skip_env()** — ~85 sites, 34 files
6. **test: migrate runtime pytest.skip() to typed skip_bug()** — 2 sites, 2 files
7. **test: update comments referencing pytest.skip() to skip()** — 7 comments, 5 files
8. **test/pylib: reject bare pytest.mark.skip and add codebase guards** — plugin hardening + 3 guard tests

- All 60 plugin + guard tests pass (`test/pylib_test/`)
- No bare `@pytest.mark.skip` or `pytest.skip()` calls remain in the codebase
- `pytest --collect-only` succeeds across all test directories with the hardened plugin

SCYLLADB-1349

Closes scylladb/scylladb#29305

* github.com:scylladb/scylladb:
  test/alternator: replace bare pytest.skip() with typed skip helpers
  test: migrate new bare skips introduced by upstream after rebase
  test/pylib: reject bare pytest.mark.skip and add codebase guards
  test: update comments referencing pytest.skip() to skip_env()
  test: migrate runtime pytest.skip() to typed skip_bug()
  test: migrate runtime pytest.skip() to typed skip_env()
  test: migrate bare @pytest.mark.skip to skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
  test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
  test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
2026-04-22 15:48:27 +03:00
Avi Kivity
e84e7dfb7a build: drop utils/rolling_max_tracker.hh from precompiled header
Added by mistake. Precompiled headers should only include library
headers that rarely change, since any dependency change causes a
full rebuild.

Closes scylladb/scylladb#29560
2026-04-22 15:46:50 +03:00
Botond Dénes
3aced88586 Merge 'audit: decrease allocations / instructions on will_log() fast path' from Marcin Maliszkiewicz
Audit::will_log() runs on every CQL/Alternator request. Since
9646ee05bd it constructs three temporary sstrings per call to look up
the audited keyspaces set / tables map with std::string_view keys,
costing ~180 insns/op and 2 allocations if sstring misses SSO.

This series switches the containers to std::less<> comparators to
enable heterogeneous lookup, then drops the sstring temporaries from
will_log().

perf-simple-query --smp 1 --duration 15 --audit "table"
                  --audit-keyspaces "ks-non-existing"
                  --audit-categories "DCL,DDL,AUTH,DML,QUERY"

  baseline         3d0582d51e          36777 insns/op
  regression     9646ee05bd          36952        (+175)
  this series                                      36768        (-184, fixed)

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1616
Backport: no, offending commit is not backported

Closes scylladb/scylladb#29565

* github.com:scylladb/scylladb:
  audit: drop sstring temporaries on the will_log() fast path
  audit: enable heterogeneous lookup on audited keyspaces/tables
2026-04-22 15:46:16 +03:00
Marcin Maliszkiewicz
4043d95810 Merge 'storage_service: fix REST API races during shutdown and cross-shard forwarding' from Piotr Smaron
REST route removal unregisters handlers but does not wait for requests
that already entered storage_service.  A request can therefore suspend
inside an async operation, restart proceeds to tear the service down,
and the coroutine later resumes against destroyed members such as
_topology_state_machine, _group0, or _sys_ks — a use-after-destruction
bug that surfaces as UBSAN dynamic-type failures (e.g. the crash seen
from topology_state_load()).

Fix this by holding storage_service::_async_gate from the entry
boundary of every externally-triggered async operation so that stop()
drains them before teardown begins.  The gate is acquired in
run_with_api_lock, run_with_no_api_lock, and in individual REST
handlers that bypass those wrappers (reload_raft_topology_state,
mark_excluded, removenode, schema reload, topology-request
waits/abort, cleanup, ring/schema queries, SSTable dictionary
training/publish, and sampling).

Additionally, fix get_ownership() and abort_topology_request() which
forward work to shard 0 but were still referencing the caller-shard's
`this` pointer instead of the destination-shard instance, causing
silent cross-shard access to shard-local state.
Add a cluster regression test that repeatedly exercises the multi-shard
ownership REST path to cover the forwarding fix.

Fixes: SCYLLADB-1415

Should be backported to all branches, the code has been introduced around 2024.1 release.

Closes scylladb/scylladb#29373

* github.com:scylladb/scylladb:
  storage_service: fix shard-0 forwarding in REST helpers
  storage_service: gate REST-facing async operations during shutdown
  storage_service: prepare for async gate in REST handlers
2026-04-22 14:43:31 +02:00
Radosław Cybulski
cc39b54173 alternator: use stream_arn instead of std::string in list_streams
Use `stream_arn` object for storage of last returned to the user stream
instead of raw `std::string`. `stream_arn` is used for parsing ARN
incoming from the user, for returning `std::string` was used because of
buggy copy / move operations of `stream_arn`. Those were fixed, so we're
fixing usage as well.

Fixes: SCYLLADB-1241

Closes scylladb/scylladb#29578
2026-04-22 14:02:53 +02:00
Artsiom Mishuta
183c6d120e test: exclude pylib_test from default test runs
Add pylib_test to norecursedirs in pytest.ini so it is not collected
during ./test.py or pytest test/ runs, but can still be run directly
via 'pytest test/pylib_test'.

Also fix pytest log cleanup: worker log files (pytest_gw*) were not
being deleted on success because cleanup was restricted to the main
process only. Now each process (main and workers) cleans up its own
log file on success.

Closes scylladb/scylladb#29551
2026-04-22 11:38:40 +02:00
Piotr Smaron
dffb266b79 storage_service: fix shard-0 forwarding in REST helpers
get_ownership() and abort_topology_request() forward work to shard 0
via container().invoke_on(0, ...) but the lambda captured 'this' and
accessed members through it instead of through the shard-0 'ss'
parameter.  This means the lambda used the caller-shard's instance,
defeating the purpose of the forwarding.

Use the 'ss' parameter consistently so the operations run against the
correct shard-0 state.
2026-04-22 10:30:33 +02:00
Piotr Smaron
6a91d046f3 storage_service: gate REST-facing async operations during shutdown
Hold _async_gate in all REST-facing async operations so that stop()
drains in-flight requests before teardown, preventing use-after-free
crashes when REST calls race with shutdown.

A centralized gated() wrapper in set_storage_service (api/storage_service.cc)
automatically holds the gate for every REST handler registered there,
so new handlers get shutdown-safety by default.

run_with_api_lock_internal and run_with_no_api_lock hold _async_gate on
shard 0 as well, because REST requests arriving on any shard are forwarded
there for execution.

Methods that previously self-forwarded to shard 0 (mark_excluded,
prepare_for_tablets_migration, set_node_intended_storage_mode,
get_tablets_migration_status, finalize_tablets_migration) now assert
this_shard_id() == 0.  Their REST handlers call them via
run_with_no_api_lock, which performs the shard-0 hop and gate hold
centrally.

Fixes: SCYLLADB-1415
2026-04-22 10:30:33 +02:00
Piotr Smaron
74dd33811e storage_service: prepare for async gate in REST handlers
Add hold_async_gate() public accessor for use by the REST registration
layer in a followup commit.

Convert run_with_no_api_lock to a coroutine so a followup commit can
hold the async gate across the entire forwarded operation.

No functional changes.
2026-04-22 10:28:54 +02:00
Botond Dénes
18ceeaf3ef Merge 'Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode' from Raphael Raph Carvalho
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-231.

Closes scylladb/scylladb#29310

* github.com:scylladb/scylladb:
  compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
  test/repair: Add tombstone GC safety tests for incremental repair
2026-04-22 10:21:37 +03:00
Avi Kivity
f5eb99f149 test: bump multishard_query_test querier_cache TTL to 60s to avoid flake
Three test cases in multishard_query_test.cc set the querier_cache entry
TTL to 2s and then assert, between pages of a stateful paged query, that
cached queriers are still present (population >= 1) and that
time_based_evictions stays 0.

The 2s TTL is not load-bearing for what these tests exercise — they are
checking the paging-cache handoff, not TTL semantics. But on busy CI
runners (SCYLLADB-1642 was observed on aarch64 release), scheduling
jitter between saving a reader and sampling the population can exceed
2s. When that happens, the TTL fires, both saved queriers are
time-evicted, population drops to 0, and the assertion
`require_greater_equal(saved_readers, 1u)` fails. The trailing
`require_equal(time_based_evictions, 0)` check never runs because the
earlier assertion has already aborted the iteration — which is why the
Jenkins failure surfaces only as a bare "C++ failure at seastar_test.cc:93".

Reproduced deterministically in test_read_with_partition_row_limits by
injecting a `seastar::sleep(2500ms)` between the save and the sample:
the hook then reports
  population=0 inserts=2 drops=0 time_based_evictions=2 resource_based_evictions=0
and the assertion fires — matching the Jenkins symptoms exactly.

Bump the TTL to 60s in all three affected tests:

  - test_read_with_partition_row_limits (confirmed repro for SCYLLADB-1642)
  - test_read_all                       (same pattern, same invariants — suspect)
  - test_read_all_multi_range           (same pattern, same invariants — suspect)

Leave test_abandoned_read (1s TTL, actually tests TTL-driven eviction)
and test_evict_a_shard_reader_on_each_page (tests manual eviction via
evict_one(); its TTL is not load-bearing but the fix is deferred for a
separate review) unchanged.

Fixes: SCYLLADB-1642

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Closes scylladb/scylladb#29564
2026-04-22 09:48:59 +03:00
Tomasz Grabiec
cddde464ca Merge 'service: Support adding/removing a datacenter with tablets by changing RF' from Aleksandra Martyniuk
With this change, you can add or remove a DC(s) in a single ALTER KEYSPACE statement. It requires the keyspace to use rack list replication factor.

In existing approach, during RF change all tablet replicas are rebuilt at once. This isn't the case now. In global_topology_request::keyspace_rf_change the request is added to a ongoing_rf_changes - a new column in system.topology table. In a new column in system_schema.keyspaces - next_replication - we keep the target RF.

In make_rf_change_plan, load balancer schedules necessary migrations, considering the load of nodes and other pending tablet transitions. Requests from ongoing_rf_changes are processed concurrently, independently from one another. In each request racks are processed concurrently. No tablet replica will be removed until all required replicas are added. While adding replicas to each rack we always start with base tables and won't proceed with views until they are done (while removing - the other way around). The intermediary steps aren't reflected in schema. When the Rf change is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved;
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.

Until the request is done, DESCRIBE KEYSPACE shows the replication_v2.

If a request hasn't started to remove replicas, it can be aborted using task manager. system.topology_requests::error is set (but the request isn't marked as done) and next_replication = replication_v2. This will be interpreted by load balancer, that will start the rollback of the request. After the rollback is done, we set the relevant system.topology_requests entry as done (failed), clear the request id from system.topology::ongoing_rf_changes, and remove next_replication.

Fixes: SCYLLADB-567.

No backport needed; new feature.

Closes scylladb/scylladb#24421

* github.com:scylladb/scylladb:
  service: fix indentation
  docs: update documentation
  test: test multi RF changes
  service: tasks: allow aborting ongoing RF changes
  cql3: allow changing RF by more than one when adding or removing a DC
  service: handle multi_rf_change
  service: implement make_rf_change_plan
  service: add keyspace_rf_change_plan to migration_plan
  service: extend tablet_migration_info to handle rebuilds
  service: split update_node_load_on_migration
  service: rearrange keyspace_rf_change handler
  db: add columns to system_schema.keyspaces
  db: service: add ongoing_rf_changes to system.topology
  gms: add keyspace_multi_rf_change feature
2026-04-22 01:46:11 +02:00
Andrzej Jackowski
b6cb025e9b test/audit: add reproducer for native-protocol batch not being audited
The existing test_batch sends a textual BEGIN BATCH ... APPLY BATCH as a
QUERY message, which goes through the CQL parser and raw::batch_statement::
prepare() — a path that correctly sets audit_info. This missed the bug
where native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a batch_statement
without setting audit_info, causing audit to silently skip the batch.

Add _test_batch_native_protocol which uses the driver's BatchStatement
(both unprepared and prepared variants) to exercise this code path.

Refs SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
f5bb9b6282 audit: set audit_info for native-protocol BATCH messages
Commit 16b56c2451 ("Audit: avoid dynamic_cast on a hot path") moved
audit info into batch_statement via set_audit_info(), but only wired it
for the CQL-text BATCH path (raw::batch_statement::prepare()).
Native-protocol BATCH messages (opcode 0x0D), handled by
process_batch_internal in transport/server.cc, construct a
batch_statement without setting audit_info. This causes audit to
silently skip the entire batch.

Set audit_info on the batch_statement so these batches are audited.

Fixes SCYLLADB-1652
2026-04-21 21:52:26 +02:00
Andrzej Jackowski
5f93d57d6e test/audit: rename internal test methods to avoid CI misdetection
The CI heuristic picks up any function named test_* in changed files
and tries to run it as a standalone pytest test. The AuditTester class
methods (test_batch, test_dml, etc.) are not top-level pytest tests —
they are internal helpers called from the actual test functions.

Prefix them with underscore so CI does not mistake them for
standalone tests.
2026-04-21 21:52:26 +02:00
Dario Mirovic
cf237e060a test: auth_cluster: use safe_driver_shutdown() for Cluster teardown
A handful of cassandra-driver Cluster.shutdown() call sites in the
auth_cluster tests were missed by the previous sweep that introduced
safe_driver_shutdown(), because the local variable holding the Cluster
is named "c" rather than "cluster".

Direct Cluster.shutdown() is racy: the driver's "Task Scheduler"
thread may raise RuntimeError ("cannot schedule new futures after
shutdown") during or after the call, occasionally failing tests.
safe_driver_shutdown() suppresses this expected RuntimeError and
joins the scheduler thread.

Replace the remaining c.shutdown() calls in:
  - test/cluster/auth_cluster/test_startup_response.py
  - test/cluster/auth_cluster/test_maintenance_socket.py
with safe_driver_shutdown(c) and add the corresponding import from
test.pylib.driver_utils.

No behavioral change to the tests; only the driver teardown is
hardened against a known driver-side race.

Fixes SCYLLADB-1662

Closes scylladb/scylladb#29576
2026-04-21 17:45:11 +02:00
Radosław Cybulski
6f7bf30a14 alternator: increase wait time to tablet sync
When forcing tablet count change via cql command, the underlying
tablet machinery takes some time to adjust. Original code waited
at most 0.1s for tablet data to be synchronized. This seems to be
not enough on debug builds, so we add exponential backoff and increase
maximum waiting time. Now the code will wait 0.1s first time and
continue waiting with each time doubling the time, up to maximum of 6 times -
or total time ~6s.

Fixes: SCYLLADB-1655

Closes scylladb/scylladb#29573
2026-04-21 17:38:07 +02:00
Radosław Cybulski
74b523ea20 treewide: fix spelling errors.
Fix various spelling errors.

Closes scylladb/scylladb#29574
2026-04-21 18:20:26 +03:00
Piotr Dulikowski
cb8253067d Merge 'strong_consistency: fix crash when DROP TABLE races with in-flight DML' from Petr Gusev
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in `groups_manager::acquire_server()` because the
raft group has already been erased from `_raft_groups`.

A concurrent `DROP TABLE` may have already removed the table from database
registries and erased the raft group via `schedule_raft_group_deletion`.
The `schema.table()` in `create_operation_ctx()` might not fail though
because someone might be holding `lw_shared_ptr<table>`, so that the
table is dropped but the table object is still alive.

Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via `find_column_family` before looking up
the raft group.  If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error.  When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.

backport: not needed (not released feature)

Fixes SCYLLADB-1450

Closes scylladb/scylladb#29430

* github.com:scylladb/scylladb:
  strong_consistency: fix crash when DROP TABLE races with in-flight DML
  test: add regression test for DROP TABLE racing with in-flight DML
2026-04-21 16:54:20 +02:00
Dario Mirovic
bcda39f716 test: audit: use set diff to identify new audit rows
assert_entries_were_added asserted that new audit rows always appear at
the tail of each per-node, event_time-sorted sequence. That invariant
is not a property of the audit feature: audit writes are asynchronous
with respect to query completion, and on a multi-node cluster QUORUM
reads of audit.audit_log can reveal a row with an older event_time
after a row with a newer one has already been observed.

Replace the positional tail slice with a per-node set difference
between the rows observed before and after the audited operation.
The wait_for retry loop, noise filtering, and final by-value
comparison against expected_entries are unchanged, so the test still
verifies the real contract, that the expected audit entries appear,
without relying on a visibility-ordering invariant that the audit log
does not guarantee.

Fixes SCYLLADB-1589

Closes scylladb/scylladb#29567
2026-04-21 15:33:36 +02:00
Nadav Har'El
6165124fcc Merge 'cql3: statement_restrictions: analyze during prepare time' from Avi Kivity
The statement_restrictions code is responsible for analyzing the WHERE
clause, deciding on the query plan (which index to use), and extracting
the partition and clustering keys to use for the index.

Currently, it suffers from repetition in making its decisions: there are 15
calls to expr::visit in statement_restrictions.cc, and 14 find_binop calls. This
reduces to 2 visits (one nested in the other) and 6 find_binop calls. The analysis
of binary operators is done once, then reused.

The key data structure introduced is the predicate. While an expression
takes inputs from the row evaluated, constants, and bind variables, and
produces a boolean result, predicates ask which values for a column (or
a number of columns) are needed to satisfy (part of) the WHERE clause.
The WHERE clause is then expressed as a conjunction of such predicates.
The analyzer uses the predicates to select the index, then uses the predicates
to compute the partition and clustering keys.

The refactoring is composed of these parts (but patches from different parts
are interspersed):

1. an exhaustive regression test is added as the first commit, to ensure behavior doesn't change
2. move computation from query time to prepare time
3. introduce, gradually enrich, and use predicates to implement the statement_restrictions API

Major refactoring, and no bugs fixed, so definitely not backporting.

Closes scylladb/scylladb#29114

* github.com:scylladb/scylladb:
  cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
  cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
  cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
  cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
  cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
  cql3: statement_restrictions: use predicate vector size for clustering prefix length
  cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
  cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
  cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
  cql3: statement_restrictions: add predicate-based index support checking
  cql3: statement_restrictions: use pre-built single-column maps for index support checks
  cql3: statement_restrictions: build clustering-prefix restrictions incrementally
  cql3: statement_restrictions: build partition-range restrictions incrementally
  cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
  cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
  cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
  cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
  cql3: statement_restrictions: track has-token state incrementally
  cql3: statement_restrictions: track partition-key-empty state incrementally
  cql3: statement_restrictions: track first multi-column predicate incrementally
  cql3: statement_restrictions: track last clustering column incrementally
  cql3: statement_restrictions: track clustering-has-slice incrementally
  cql3: statement_restrictions: track has-multi-column-clustering incrementally
  cql3: statement_restrictions: track clustering-empty state incrementally
  cql3: statement_restrictions: replace restr bridge variable with pred.filter
  cql3: statement_restrictions: convert single-column branch to use predicate properties
  cql3: statement_restrictions: convert multi-column branch to use predicate properties
  cql3: statement_restrictions: convert constructor loop to iterate over predicates
  cql3: statement_restrictions: annotate predicates with operator properties
  cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
  cql3: statement_restrictions: complete preparation early
  cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
  cql3: statement_restrictions: refine possible_lhs_values() function_call processing
  cql3: statement_restrictions: return nullptr for function solver if not token
  cql3: statement_restrictions: refine possible_lhs_values() subscript solving
  cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
  cql3: statement_restrictions: convert possible_lhs_values into a solver
  cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
  cql3: statement_restrictions: refactor IS NOT NULL processing
  cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
  cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
  cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
  cql3: statement_restrictions: fold add_is_not_restriction() into its caller
  cql3: statement_restrictions: fold add_restriction() into its caller
  cql3: statement_restrictions: remove possible_partition_token_values()
  cql3: statement_restrictions: remove possible_column_values
  cql3: statement_restrictions: pass schema to possible_column_values()
  cql3: statement_restrictions: remove fallback path in solve()
  cql3: statement_restrictions: reorder possible_lhs_column parameters
  cql3: statement_restrictions: prepare solver for multi-column restrictions
  cql3: statement_restrictions: add solver for token restriction on index
  cql3: statement_restrictions: pre-analyze column in value_for()
  cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
  cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
  cql3: statement_restrictions: adjust signature of range_from_raw_bounds
  cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
  cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
  cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
  cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
  cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
  cql3: statement_restrictions: wrap value_for_index_partition_key()
  cql3: statement_restrictions: hide value_for()
  cql3: statement_restrictions: push down clustering prefix wrapper one level
  cql3: statement_restrictions: wrap functions that return clustering ranges
  cql3: statement_restrictions: do not pass view schema back and forth
  cql3: statement_restrictions: pre-analyze token range restrictions
  cql3: statement_restrictions: pre-analyze partition key columns
  cql3: statement_restrictions: do not collect subscripted partition key columns
  cql3: statement_restrictions: split _partition_range_restrictions into three cases
  cql3: statement_restrictions: move value_list, value_set to header file
  cql3: statement_restrictions: wrap get_partition_key_ranges
  cql3: statement_restrictions: prepare statement_restrictions for capturing `this`
  test: statement_restrictions: add index_selection regression test
2026-04-21 15:44:06 +03:00
Anna Stuchlik
d222e6e2a4 doc: document support for OCI Object Storage
This commit extends the object storage configuration section
with support for OCi object storage.

Fixes SCYLLADB-502

Closes scylladb/scylladb#29503
2026-04-21 15:11:58 +03:00
Botond Dénes
cfebe17592 sstables: fix segfault in parse_assert() when message is nullptr
parse_assert() accepts an optional `message` parameter that defaults
to nullptr. When the assertion fails and message is nullptr, it is
implicitly converted to sstring via the sstring(const char*) constructor,
which calls strlen(nullptr) -- undefined behavior that manifests as a
segfault in __strlen_evex.

This turns what should be a graceful malformed_sstable_exception into a
fatal crash. In the case of CUSTOMER-279, a corrupt SSTable triggered
parse_assert() during streaming (in continuous_data_consumer::
fast_forward_to()), causing a crash loop on the affected node.

Fix by guarding the nullptr case with a ternary, passing an empty
sstring() when message is null. on_parse_error() already handles
the empty-message case by substituting "parse_assert() failed".

Fixes: SCYLLADB-1329

Closes scylladb/scylladb#29285
2026-04-21 12:40:33 +02:00
Marcin Maliszkiewicz
935e6a495d Merge 'transport: add per-service-level cql_requests_serving metric' from Piotr Smaron
The existing scylla_transport_requests_serving metric is a single global per-shard gauge counting outstanding CQL requests. When debugging latency spikes, it's useful to know which service level is contributing the most in-flight requests.
This PR adds a new per-scheduling-group gauge scylla_transport_cql_requests_serving (with the scheduling_group_name label), using the existing cql_sg_stats per-SG infrastructure. The cql_ prefix is intentional — it follows the convention of all other per-SG transport metrics (cql_requests_count, cql_request_bytes, etc.) and avoids Prometheus confusion with the global requests_serving metric (which lacks the scheduling_group_name label).

Fixes: SCYLLADB-1340

New feature, no backport.

Closes scylladb/scylladb#29493

* github.com:scylladb/scylladb:
  transport: add per-service-level cql_requests_serving metric
  transport: move requests_serving decrement to after response is sent
2026-04-21 12:35:50 +02:00
Aleksandra Martyniuk
cd79b99112 test: fix flaky test_alter_tablets_rf_dc_drop by using read barrier
The test was reading system_schema.keyspaces from an arbitrary node
that may not have applied the latest schema change yet. Pin the read
to a specific node and issue a read barrier before querying, ensuring
the node has up-to-date data.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1643.

Closes scylladb/scylladb#29563
2026-04-21 09:12:51 +03:00
Raphael S. Carvalho
474e962e01 compaction: Restrict tombstone GC sstable set to repaired sstables for tombstone_gc=repair mode
When tombstone_gc=repair, the repaired compaction view's sstable_set_for_tombstone_gc()
previously returned all sstables across all three views (unrepaired, repairing, repaired).
This is correct but unnecessarily expensive: the unrepaired and repairing sets are never
the source of a GC-blocking shadow when tombstone_gc=repair, for base tables.

The key ordering guarantee that makes this safe is:
- topology_coordinator sends send_tablet_repair RPC and waits for it to complete.
  Inside that RPC, mark_sstable_as_repaired() runs on all replicas, moving D from
  repairing → repaired (repaired_at stamped on disk).
- Only after the RPC returns does the coordinator commit repair_time + sstables_repaired_at
  to Raft.
- gc_before = repair_time - propagation_delay only advances once that Raft commit applies.

Therefore, when a tombstone T in the repaired set first becomes GC-eligible (its
deletion_time < gc_before), any data D it shadows is already in the repaired set on
every replica. This holds because:
- The memtable is flushed before the repairing snapshot is taken (take_storage_snapshot
  calls sg->flush()), capturing all data present at repair time.
- Hints and batchlog are flushed before the snapshot, ensuring remotely-hinted writes
  arrive before the snapshot boundary.
- Legitimate unrepaired data has timestamps close to 'now', always newer than any
  GC-eligible tombstone (USING TIMESTAMP to write backdated data is user error / UB).

Excluding the repairing and unrepaired sets from the GC shadow check cannot cause any
tombstone to be wrongly collected. The memtable check is also skipped for the same
reason: memtable data is either newer than the GC-eligible tombstone, or was flushed
into the repairing/repaired set before gc_before advanced.

Safety restriction — materialized views:
The optimization IS applied to materialized view tables. Two possible paths could inject
D_view into the MV's unrepaired set after MV repair: view hints and staging via the
view-update-generator. Both are safe:

(1) View hints: flush_hints() creates a sync point covering BOTH _hints_manager (base
mutations) AND _hints_for_views_manager (view mutations). It waits until ALL pending view
hints — including D_view entries queued in _hints_for_views_manager while the target MV
replica was down — have been replayed to the target node before take_storage_snapshot() is
called. D_view therefore lands in the MV's repairing sstable and is promoted to repaired.
When a repaired compaction then checks for shadows it finds D_view in the repaired set,
keeping T_mv non-purgeable.

(2) View-update-generator staging path: Base table repair can write a missing D_base to a
replica via a staging sstable. The view-update-generator processes the staging sstable
ASYNCHRONOUSLY: it may fire arbitrarily later, even after MV repair has committed
repair_time and T_mv has been GC'd from the repaired set. However, the staging processor
calls stream_view_replica_updates() which performs a READ-BEFORE-WRITE via
as_mutation_source_excluding_staging(): it reads the CURRENT base table state before
building the view update. If T_base was written to the base table (as it always is before
the base replica can be repaired and the MV tombstone can become GC-eligible), the
view_update_builder sees T_base as the existing partition tombstone. D_base's row marker
(ts_d < ts_t) is expired by T_base, so the view update is a no-op: D_view is never
dispatched to the MV replica. No resurrection can occur regardless of how long staging is
delayed.

A potential sub-edge-case is T_base being purged BEFORE staging fires (leaving D_base as
the sole survivor, so stream_view_replica_updates would dispatch D_view). This is blocked
by an additional invariant: for tablet-based tables, the repair writer stamps repaired_at
on staging sstables (repair_writer_impl::create_writer sets mark_as_repaired = true and
perform_component_rewrite writes repaired_at = sstables_repaired_at + 1 on every staging
sstable). After base repair commits sstables_repaired_at to Raft, the staging sstable
satisfies is_repaired(sstables_repaired_at, staging_sst) and therefore appears in
make_repaired_sstable_set(). Any subsequent base repair that advances sstables_repaired_at
further still includes the staging sstable (its repaired_at ≤ new sstables_repaired_at).
D_base in the staging sstable thus shadows T_base in every repaired compaction's shadow
check, keeping T_base non-purgeable as long as D_base remains in staging.

A base table hint also cannot bypass this. A base hint is replayed as a base mutation. The
resulting view update is generated synchronously on the base replica and sent to the MV
replica via _hints_for_views_manager (path 1 above), not via staging.

USING TIMESTAMP with timestamps predating (gc_before + propagation_delay) is explicitly
UB and excluded from the safety argument.

For tombstone_gc modes other than repair (timeout, immediate, disabled) the invariant
does not hold for base tables either, so the full storage-group set is returned.

Implementation:
- Add compaction_group::is_repaired_view(v): pointer comparison against _repaired_view.
- Add compaction_group::make_repaired_sstable_set(): iterates _main_sstables and inserts
  only sstables classified as repaired (repair::is_repaired(sstables_repaired_at, sst)).
- Add storage_group::make_repaired_sstable_set(): collects repaired sstables across all
  compaction groups in the storage group.
- Add table::make_repaired_sstable_set_for_tombstone_gc(): collects repaired sstables from
  all compaction groups across all storage groups (needed for multi-tablet tables).
- Add compaction_group_view::skip_memtable_for_tombstone_gc(): returns true iff the
  repaired-only optimization is active; used by get_max_purgeable_timestamp() in
  compaction.cc to bypass the memtable shadow check.
- is_tombstone_gc_repaired_only() private helper gates both methods: requires
  is_repaired_view(this) && tombstone_gc_mode == repair. No is_view() exclusion.
- Add error injection "view_update_generator_pause_before_processing" in
  process_staging_sstables() to support testing the staging-delay scenario.
- New test test_tombstone_gc_mv_optimization_safe_via_hints: stops servers[2], writes
  D_base + T_base (view hints queued for servers[2]'s MV replica), restarts, runs MV
  tablet repair (flush_hints delivers D_view + T_mv before snapshot), triggers repaired
  compaction, and asserts the MV row is NOT visible — T_mv preserved because D_view
  landed in the repaired set via the hints-before-snapshot path.
- New test test_tombstone_gc_mv_safe_staging_processor_delay: runs base repair before
  writing T_base so D_base is staged on servers[0] via row-sync; blocks the
  view-update-generator with an error injection; writes T_base + T_mv; runs MV repair
  (fast path, T_mv GC-eligible); triggers repaired compaction (T_mv purged — no D_view
  in repaired set); asserts no resurrection; releases injection; waits for staging to
  complete; asserts no resurrection after a second flush+compaction. Demonstrates that
  the read-before-write in stream_view_replica_updates() makes the optimization safe even
  when staging fires after T_mv has been GC'd.

The expected gain is reduced bloom filter and memtable key-lookup I/O during repaired
compactions: the unrepaired set is typically the largest (it holds all recent writes),
yet for tombstone_gc=repair it never influences GC decisions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 16:59:09 -03:00
Ferenc Szili
a50aa7e689 test/cluster: wait for ready CQL in cross-rack merge test
test_tablet_merge_cross_rack_migrations() starts issuing DDL immediately
after adding the new cross-rack nodes. In the failing runs the driver is
still converging on the updated topology at that point, so the control
connection sees incomplete peer metadata while schema changes are in
flight.

That leaves a race where CREATE TABLE is sent during topology churn and
the test can surface a misleading AlreadyExists error even though the
table creation has already been committed. Use get_ready_cql(servers)
here so the test waits for inter-node visibility and CQL readiness
before creating the keyspace and table.

Fixes: SCYLLADB-1635

Closes scylladb/scylladb#29561
2026-04-20 20:12:11 +02:00
Łukasz Paszkowski
d18eb9479f cql/statement: Create keyspace_metadata with correct initial_tablets count
In `ks_prop_defs::as_ks_metadata(...)` a default initial tablets count
is set to 0, when tablets are enabled and the replication strategy
is NetworkReplicationStrategy.

This effectively sets _uses_tablets = false in abstract_replication_strategy
for the remaining strategies when no `tablets = {...}` options are specified.
As a consequence, it is possible to create vnode-based keyspaces even
when tablets are enforced with `tablets_mode_for_new_keyspaces`.

The patch sets a default initial tablets count to zero regardless of
the chosen replication strategy. Then each of the replication strategy
validates the options and raises a configuration exception when tablets
are not supported.

All tests are altered in the following way:
+ whenever it was correct, SimpleStrategy was replaced with NetworkTopologyStrategy
+ otherwise, tablets were explicitly disabled with ` AND tablets = {'enabled': false}`

Fixes https://github.com/scylladb/scylladb/issues/25340

Closes scylladb/scylladb#25342
2026-04-20 17:57:38 +03:00
Botond Dénes
69c58c6589 Merge 'streaming: add oos protection in mutation based streaming' from Łukasz Paszkowski
The mutation-fragment-based streaming path in `stream_session.cc` did not check whether the receiving node was in critical disk utilization mode before accepting incoming mutation fragments. This meant that operations like `nodetool refresh --load-and-stream`, which stream data through the `STREAM_MUTATION_FRAGMENTS` RPC handler, could push data onto a node that had already reached critical disk usage.

The file-based streaming path in stream_blob.cc already had this protection, but the load&stream path was missing it.

This patch adds a check for `is_in_critical_disk_utilization_mode()` in the `stream_mutation_fragments` handler in `stream_session.cc`, throwing a `replica::critical_disk_utilization_exception` when the node is at critical disk usage. This mirrors the existing protection in the blob streaming path and closes the gap that allowed data to be written to a node that should have been rejecting all incoming writes.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-901

The out of space prevention mechanism was introduced in 2025.4. The fix should be backported there and all later versions.

Closes scylladb/scylladb#28873

* github.com:scylladb/scylladb:
  streaming: reject mutation fragments on critical disk utilization
  test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
  sstables: clean up TemporaryHashes file in wipe()
  sstables: add error injection point in write_components
  test/cluster/storage: extract validate_data_existence to module scope
  test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
  utils/disk_space_monitor: add error injection to suppress threshold checks
2026-04-20 17:56:36 +03:00
David Garcia
16ed338a89 Fix CODEOWNERS to cover nested docs subfolders
The `docs/*` pattern only matches files directly inside `docs/`,
not files in nested subfolders like `docs/folder_b/test.md` or
`docs/alternator/setup.md`. Those files currently have no code
owner assigned.

Replace with `/docs/` and `/docs/alternator/` which match the
directories and all their subdirectories recursively, per GitHub's
CODEOWNERS syntax.

Ref: https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-code-owners

Closes scylladb/scylladb#29521
2026-04-20 17:55:43 +03:00
Avi Kivity
5687a4840d conf: pair sstable_format=ms with column_index_size_in_kb=1
One of the advantages of Trie indexes (with sstable_format=ms) is that
the index is more compact, and more suitable for paging from disk
(fewer pages required per search). We can exploit it by setting
column_index_size_in_kb to 1 rather than 64, increasing the index file
size (and requiring more index pages to be loaded and parsed) in return
for smaller data file reads.

To test this, I created a 1M row partition with 300-byte rows, compacted
it into a single sstable, and tested reads to a single row.

With column_index_size_in_kb=64:

Rows.db file size 60k
3 pages read from Rows.db (4k each)
2x 32k read from Data.db

With column_index_size_in_kb=1:

Rows.db file size 2MB (33X)
5 pages read from Rows.db (4k each, 1.7X)
1x 4107 bytes read from Data.db (0.5X IOPS, 0.06X bandwidth)

Given that Rows.db will be typically cached, or at least all but one of the
levels (its size is 157X smaller than Data.db), we win on both IOPS
and bandwidth.

I would have expected the the Data.db read to be closer to 1k, but this
is already an improvement.

Given that, set column_index_size_in_kb=1, but only for new clusters
where we also select sstable_format=ms.

Raw data (w1, w64 are working directories with different
column_index_size_in_kb):

```console
$ ls -l w*/data/bench/wide_partition-*/*{Rows,Data}.db
-rw-r--r-- 1 avi avi 314964958 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db
-rw-r--r-- 1 avi avi   2001227 Apr 19 16:17 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db
-rw-r--r-- 1 avi avi 314963261 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db
-rw-r--r-- 1 avi avi     59989 Apr 19 16:18 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db
```

column_index_size_in_kb=64 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | 9OXdwmDHRapL2w5YruWLTOtiC3PKbyctSDdQ8YpuPKtWkSYBF10G7bKo2rdnxSAd52HLI21568YM7OwK05B6qAF7X2b6910qsJEA106QBEcFWQVybMCkxkpO4VDRcAVNLRgjB3vygcDBP17GBTb2s7l47UOloy3KtZ7J5YQgKcf7zlFSKGHa49vnRrzoXZCdYexOpix6jcSV2SiwRNqgv6XmYhx43ZwGa4zUtOe0eIKJj7KTxu5bzyWUWGW7US4NLFZRD8Vdb6EasIFkOfVKdiFp2LZHMXGRvtvdF93UTFUb

(1 rows)

Tracing session: 19219900-3bf3-11f1-bc43-c0a4e62b53d1

 activity                                                                                                                                                                                                                 | timestamp                        | source    | source_elapsed | client
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                       Execute CQL3 query |       2026-04-19 16:24:30.992000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                                 Parsing a statement [shard 0/sl:default] | 2026-04-19 16:24:30.992643+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                            Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:24:30.992738+00:00 | 127.0.0.1 |             96 | 127.0.0.1
                                                                                                                                                               Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:24:30.992765+00:00 | 127.0.0.1 |            123 | 127.0.0.1
                        Creating read executor for token -3485513579396041028 with all: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] targets: [cf134ebd-5f1b-4844-94e3-e5c7ad9421f0] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:24:30.992781+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                           Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:24:30.992782+00:00 | 127.0.0.1 |            140 | 127.0.0.1
                                                                                                                                                                         read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:24:30.992795+00:00 | 127.0.0.1 |            153 | 127.0.0.1
                                                                                                                            Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:24:30.992801+00:00 | 127.0.0.1 |            160 | 127.0.0.1
                                                                                                                                      [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:24:30.992805+00:00 | 127.0.0.1 |            163 | 127.0.0.1
                                                                                                                                            [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:24:30.992814+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                        Reading key {-3485513579396041028, pk{000400000000}} from sstable w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db [shard 0/sl:default] | 2026-04-19 16:24:30.992837+00:00 | 127.0.0.1 |            195 | 127.0.0.1
                                         page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.992851+00:00 | 127.0.0.1 |            209 | 127.0.0.1
                                              page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995294+00:00 | 127.0.0.1 |           2653 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995375+00:00 | 127.0.0.1 |           2733 | 127.0.0.1
                                               page cache miss: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2, readahead=1 [shard 0/sl:default] | 2026-04-19 16:24:30.995376+00:00 | 127.0.0.1 |           2734 | 127.0.0.1
                                                            page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=14 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                                                             page cache hit: file=w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Rows.db, page=2 [shard 0/sl:default] | 2026-04-19 16:24:30.995463+00:00 | 127.0.0.1 |           2821 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206057984 [shard 0/sl:default] | 2026-04-19 16:24:30.995471+00:00 | 127.0.0.1 |           2829 | 127.0.0.1
                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:24:30.995475+00:00 | 127.0.0.1 |           2833 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206057984, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995586+00:00 | 127.0.0.1 |           2945 | 127.0.0.1
                            Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:24:30.995637+00:00 | 127.0.0.1 |           2995 | 127.0.0.1
 w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: finished bulk DMA read of size 32768 at offset 206090752, successfully read 32768 bytes [shard 0/sl:default] | 2026-04-19 16:24:30.995645+00:00 | 127.0.0.1 |           3003 | 127.0.0.1
                                                                                                                                                                                    Querying is done [shard 0/sl:default] | 2026-04-19 16:24:30.995653+00:00 | 127.0.0.1 |           3012 | 127.0.0.1
                                                                                                                                                                Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:24:30.995670+00:00 | 127.0.0.1 |           3028 | 127.0.0.1
                                                                                                                                                                                                         Request complete |       2026-04-19 16:24:30.995039 | 127.0.0.1 |           3039 | 127.0.0.1

                              w64/data/bench/wide_partition-69d6adb03bf111f1865f3b0b343d3479/ms-3gzp_10y7_514282x1o2bojimy0q-big-Data.db: scheduling bulk DMA read of size 32768 at offset 206090752 [shard 0/sl:default] | 2026-04-19 16:22:43.107215+00:00 | 127.0.0.1 |           8685 | 127.0.0.1
```

column_index_size_in_kb=1 trace:

```
cqlsh> SELECT * FROM bench.wide_partition WHERE pk = 0 AND ck = 654321 BYPASS CACHE;

 pk | ck     | v
----+--------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
  0 | 654321 | FIA7X52ZqYwvDxEGlmWJUSy1I94WTuWZTdLwXr9HBQ90RJLqYKr5nInTADSI6hzofwawaXphAQK07YMoyzFfRaGeKPQPKUb35XpLEGvLJ4xu9r4es8wUEHPXaFBGdMcWUkyDJSTYCFzZAPCzUHEuPJHMXVrI6UExWrIR0Xujg4GZa9UciU9rbEvrSBwSzoPEfbXJ6qZSGiTD8gcXz5kdAblLxsAeWug8tZqslsTu04HMLKfZ8WopQvHbpR6YlGSnM99CiBgz30LMmllULV4VA4u9kMpzsRV2IE2tKmJOddEl

(1 rows)

Tracing session: 3953a1f0-3bf3-11f1-b976-4a3dc2a7a57f

 activity                                                                                                                                                                                                              | timestamp                        | source    | source_elapsed | client
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------+-----------+----------------+-----------
                                                                                                                                                                                                    Execute CQL3 query |       2026-04-19 16:25:25.007000 | 127.0.0.1 |              0 | 127.0.0.1
                                                                                                                                                                              Parsing a statement [shard 0/sl:default] | 2026-04-19 16:25:25.007423+00:00 | 127.0.0.1 |              1 | 127.0.0.1
                                                                                                                                         Processing a statement for authenticated user: anonymous [shard 0/sl:default] | 2026-04-19 16:25:25.007511+00:00 | 127.0.0.1 |             89 | 127.0.0.1
                                                                                                                                                            Executing read query (reversed false) [shard 0/sl:default] | 2026-04-19 16:25:25.007536+00:00 | 127.0.0.1 |            114 | 127.0.0.1
                     Creating read executor for token -3485513579396041028 with all: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] targets: [e7bd75e7-6d2a-46dc-9f66-430524f40e0d] repair decision: NONE [shard 0/sl:default] | 2026-04-19 16:25:25.007551+00:00 | 127.0.0.1 |            129 | 127.0.0.1
                                                                        Creating never_speculating_read_executor - speculative retry is disabled or there are no extra replicas to speculate with [shard 0/sl:default] | 2026-04-19 16:25:25.007553+00:00 | 127.0.0.1 |            131 | 127.0.0.1
                                                                                                                                                                      read_data: querying locally [shard 0/sl:default] | 2026-04-19 16:25:25.007556+00:00 | 127.0.0.1 |            134 | 127.0.0.1
                                                                                                                         Start querying singular range {{-3485513579396041028, pk{000400000000}}} [shard 0/sl:default] | 2026-04-19 16:25:25.007562+00:00 | 127.0.0.1 |            139 | 127.0.0.1
                                                                                                                                   [reader concurrency semaphore sl:default] admitted immediately [shard 0/sl:default] | 2026-04-19 16:25:25.007564+00:00 | 127.0.0.1 |            142 | 127.0.0.1
                                                                                                                                         [reader concurrency semaphore sl:default] executing read [shard 0/sl:default] | 2026-04-19 16:25:25.007573+00:00 | 127.0.0.1 |            151 | 127.0.0.1
                      Reading key {-3485513579396041028, pk{000400000000}} from sstable w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db [shard 0/sl:default] | 2026-04-19 16:25:25.007594+00:00 | 127.0.0.1 |            172 | 127.0.0.1
                                       page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Partitions.db, page=0, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.007607+00:00 | 127.0.0.1 |            184 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016029+00:00 | 127.0.0.1 |           8607 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016109+00:00 | 127.0.0.1 |           8687 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016111+00:00 | 127.0.0.1 |           8688 | 127.0.0.1
                                           page cache miss: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285, readahead=1 [shard 0/sl:default] | 2026-04-19 16:25:25.016176+00:00 | 127.0.0.1 |           8754 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=488 [shard 0/sl:default] | 2026-04-19 16:25:25.016260+00:00 | 127.0.0.1 |           8838 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=486 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                                                         page cache hit: file=w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Rows.db, page=285 [shard 0/sl:default] | 2026-04-19 16:25:25.016261+00:00 | 127.0.0.1 |           8839 | 127.0.0.1
                             w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: scheduling bulk DMA read of size 4107 at offset 206086656 [shard 0/sl:default] | 2026-04-19 16:25:25.016268+00:00 | 127.0.0.1 |           8846 | 127.0.0.1
 w1/data/bench/wide_partition-e0b436a03bf111f18587cc3d55b31baf/ms-3gzp_10x9_373io213ox3uf4irhr-big-Data.db: finished bulk DMA read of size 4107 at offset 206086656, successfully read 4608 bytes [shard 0/sl:default] | 2026-04-19 16:25:25.016340+00:00 | 127.0.0.1 |           8918 | 127.0.0.1
                         Page stats: 1 partition(s) (1 live, 0 dead), 0 static row(s) (0 live, 0 dead), 1 clustering row(s) (1 live, 0 dead), 0 range tombstone(s) and 1 cell(s) (1 live, 0 dead) [shard 0/sl:default] | 2026-04-19 16:25:25.016367+00:00 | 127.0.0.1 |           8945 | 127.0.0.1
                                                                                                                                                                                 Querying is done [shard 0/sl:default] | 2026-04-19 16:25:25.016385+00:00 | 127.0.0.1 |           8963 | 127.0.0.1
                                                                                                                                                             Done processing - preparing a result [shard 0/sl:default] | 2026-04-19 16:25:25.016401+00:00 | 127.0.0.1 |           8979 | 127.0.0.1
                                                                                                                                                                                                      Request complete |       2026-04-19 16:25:25.015989 | 127.0.0.1 |           8989 | 127.0.0.1
```

Closes scylladb/scylladb#29552
2026-04-20 17:53:56 +03:00
Marcin Maliszkiewicz
e414b2b0b9 test/cluster: scale failure_detector_timeout_in_ms by build mode
Six cluster test files override failure_detector_timeout_in_ms to 2000ms
for faster failure detection. In debug and sanitize builds, this causes
flaky node join failures. The following log analysis shows how.

The coordinator (server 614, IP 127.2.115.3) accepts the joining node
(server 615, host_id 53b01f0b, IP 127.2.115.2) into group0:

  20:10:57,049 [shard 0] raft_group0 - server 614 entered
    'join group0' transition state for 53b01f0b

The joining node begins receiving the raft snapshot 100ms later:

  20:10:57,150 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then spends ~280ms applying schema changes -- creating 6 keyspaces
and 12+ tables from the snapshot:

  20:10:57,511 [shard 0] migration_manager - Creating keyspace
    system_auth_v2
  ...
  20:10:57,788 [shard 0] migration_manager - Creating
    system_auth_v2.role_members

Meanwhile, the coordinator's failure detector pings the joining node.
Under debug+ASan load the RPC call times out after ~4.6 seconds:

  20:11:01,643 [shard 0] direct_failure_detector - unexpected exception
    when pinging 53b01f0b: seastar::rpc::timeout_error
    (rpc call timed out)

25ms later, the coordinator marks the joining node DOWN and removes it:

  20:11:01,668 [shard 0] raft_group0 - failure_detector_loop:
    Mark node 53b01f0b as DOWN
  20:11:01,717 [shard 0] raft_group0 - bootstrap: failed to accept
    53b01f0b

The joining node was still retrying the snapshot transfer at that point:

  20:11:01,745 [shard 0] raft_group0 - transfer snapshot from 9fa48539

It then receives the ban notification and aborts:

  20:11:01,844 [shard 0] raft_group0 - received notification of being
    banned from the cluster

Replace the hardcoded 2000ms with the failure_detector_timeout fixture
from conftest.py, which scales by MODES_TIMEOUT_FACTOR: 3x for
debug/sanitize (6000ms), 2x for dev (4000ms), 1x for release (2000ms).

Test measurements (before -> after fix):

  debug mode:
  test_replace_with_same_ip_twice           24.02s ->  25.02s
  test_banned_node_notification            217.22s -> 221.72s
  test_kill_coordinator_during_op          116.11s -> 127.13s
  test_node_failure_during_tablet_migration
    [streaming-source]                     183.25s -> 192.69s
  test_replace (4 tests)        skipped in debug (skip_in_debug)
  test_raft_replace_ignore_nodes  skipped in debug (run_in_dev only)

  dev mode:
  test_replace_different_ip                 10.51s ->  11.50s
  test_replace_different_ip_using_host_id   10.01s ->  12.01s
  test_replace_reuse_ip                     10.51s ->  12.03s
  test_replace_reuse_ip_using_host_id       13.01s ->  12.01s
  test_raft_replace_ignore_nodes            19.52s ->  19.52s
2026-04-20 15:28:34 +02:00
Marcin Maliszkiewicz
99ac36b353 test/cluster: add failure_detector_timeout fixture
Add a shared pytest fixture that scales the failure detector timeout
by build mode factor (e.g. 3x for debug/sanitize, 2x for dev).
2026-04-20 15:28:33 +02:00
Marcin Maliszkiewicz
c136b2e640 audit: drop sstring temporaries on the will_log() fast path
audit::will_log() is called for every CQL/Alternator request. With
non-empty keyspace it does:

    _audited_keyspaces.find(sstring(keyspace))
    should_log_table(sstring(keyspace), sstring(table))

constructing three temporary sstrings from the std::string_view
arguments on every call. Now that the underlying associative containers
use std::less<> as comparator (previous commit), find() accepts the
string_view directly. Switch should_log_table() to take string_view as
well so the temporaries disappear entirely.

For short keyspace names the temporaries stay in SSO so allocs/op is
unchanged at 58.1, but each construction still costs ~60 instructions.

perf-simple-query --smp 1 --duration 15 --audit "table"
                  --audit-keyspaces "ks-non-existing"
                  --audit-categories "DCL,DDL,AUTH,DML,QUERY"

build: --mode=release --use-profile="" (no PGO)

Before (regression introduced in 9646ee05bd):
    instructions_per_op: 36952

After:
    instructions_per_op: 36768

Brings insns/op back to the pre-regression baseline 3d0582d51e
(insns/op ~36777) within the per-run noise of ~15 insns standard
deviation, eliminating the ~180 insns/op regression.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1616
2026-04-20 15:18:22 +02:00
Marcin Maliszkiewicz
724b9e66ea audit: enable heterogeneous lookup on audited keyspaces/tables
Replace the bare std::set<sstring>/std::map<sstring, std::set<sstring>>
member types with named aliases that use std::less<> as the comparator.
The transparent comparator enables heterogeneous lookup with
string_view keys.

This commit is a pure refactor with no behavioral change: the parser
return types, constructor parameters, observer template instantiations,
and start_audit() locals are all updated to use the aliases.
2026-04-20 15:14:58 +02:00
Marcin Maliszkiewicz
9f11920b15 Merge 'alternator: fix remaining problems with new Stream ARN format' from Nadav Har'El
This small series includes a few followups to the patch that changed Alternator Stream ARNs from using our own UUID format to something that resembles Amazon's Stream ARNs (and the KCL library won't reject as bogus-looking ARNs).

The first patch is the most important one, fixing ListStreams's LastEvaluatedStreamArn to also use the new ARN format. It fixes SCYLLADB-539.

The following patches are additional cleanups and tests for the new ARN code.

Closes scylladb/scylladb#29474

* github.com:scylladb/scylladb:
  alternator: fix ListStreams paging if table is deleted during paging
  test/alternator: test DescribeStream on non-existent table
  alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
  alternator: remove dead code stream_shard_id
  alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
2026-04-20 14:42:28 +02:00
Raphael S. Carvalho
a50e6215aa test/repair: Add tombstone GC safety tests for incremental repair
Add three cluster tests that verify no data resurrection occurs when
tombstone GC runs on the repaired sstable set under incremental repair
with tombstone_gc=repair mode.

All tests use propagation_delay_in_seconds=0 so that tombstones become
GC-eligible immediately after repair_time is committed (gc_before =
repair_time), allowing the scenarios to exercise the actual GC eligibility
path without artificial sleeps.

  (test_tombstone_gc_no_resurrection_basic_ordering)

Data D (ts=1) and tombstone T (ts=2) are written to all replicas and
flushed before repair.  Repair captures both in the repairing snapshot
and promotes them to repaired.  Once repair_time is committed, T is
GC-eligible (T.deletion_time < gc_before = repair_time).

The test verifies that compaction on the repaired set does NOT purge T,
because D is already in repaired (mark_sstable_as_repaired() completes
on all replicas before repair_time is committed to Raft) and clamps
max_purgeable to D.timestamp=1 < T.timestamp=2.

  (test_tombstone_gc_no_resurrection_hints_flush_failure)

The repair_flush_hints_batchlog_handler_bm_uninitialized injection causes
hints flush to fail on one node.  When hints flush fails, flush_time stays
at gc_clock::time_point{} (epoch).  This propagates as repair_time=epoch
committed to system.tablets, so gc_before = epoch - propagation_delay is
effectively the minimum possible time.  No tombstone has a deletion_time
older than epoch, so T is never GC-eligible from this repair.

The test verifies that repair_time does not advance to a meaningful value
after a failed hints flush, and that compaction on the repaired set does
not purge T (key remains deleted, no resurrection).

  (test_tombstone_gc_no_resurrection_propagation_delay)

Simulates a write D carrying an old CQL USING TIMESTAMP (ts_d = now-2h)
that was stored as a hint while a replica was down, and a tombstone T
with a higher timestamp (ts_t = now-90min, ts_t > ts_d) that was written
to all live replicas.  After the replica restarts, repair flushes hints
synchronously before taking the repairing snapshot, guaranteeing D is
delivered and captured in repairing before the snapshot.

After mark_sstable_as_repaired() promotes D to repaired, the coordinator
commits repair_time.  gc_before = repair_time > T.deletion_time so T is
GC-eligible.  The test verifies that compaction on the repaired set does
NOT purge T: D (ts_d < ts_t) is already in repaired, clamping
max_purgeable = ts_d < ts_t = T.timestamp, so T is not purgeable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-20 09:09:39 -03:00
Wojciech Mitros
6011cb8a4c db/view: track range tombstones in update stream during view update building
The view update builder ignored range tombstone changes from the update
stream when there all existing mutation fragments were already consumed.
The old code assumed range tombstones 'remove nothing pre-existing, so
we can ignore it', but this failed to update _update_current_tombstone.
Consequently, when a range delete and an insert within that range appeared
in the same batch, the range tombstone was not applied to the inserted row,
or was applied to a row outside the range that it covered causing it to
incorrectly survive/be deleted in the materialized view.

Fix by handling is_range_tombstone_change() fragments in the update-only
branch, updating _update_current_tombstone so subsequent clustering rows
correctly have the range tombstone applied to them.

Fixes SCYLLADB-1555

Closes scylladb/scylladb#29483
2026-04-20 13:38:52 +02:00
Wojciech Mitros
073710a661 view: apply existing range tombstones after exhausting the update reader
When view_update_builder::on_results() hits the path where the update
fragment reader is already exhausted, it still needs to keep tracking
existing range tombstones and apply them to encountered rows.
Otherwise a row covered by an existing range tombstone can appear
alive while generating the view update and create a spurious view row.

Update the existing tombstone state even on the exhausted-reader path
and apply the effective tombstone to clustering rows before generating
the row tombstone update. Add a cqlpy regression test covering the
partition-delete-after-range-tombstone case.

Fixes: SCYLLADB-1554

Closes scylladb/scylladb#29481
2026-04-20 13:29:05 +02:00
Dario Mirovic
40740104ab test: use DROP KEYSPACE IF EXISTS in new_test_keyspace cleanup
The new_test_keyspace context manager in test/cluster/util.py uses
DROP KEYSPACE without IF EXISTS during cleanup. The Python driver
has a known bug (scylladb/python-driver#317) where connection pool
renewal after concurrent node bootstraps causes double statement
execution. The DROP succeeds server-side, but the response is lost
when the old pool is closed. The driver retries on the new pool, and
gets ConfigurationException message "Cannot drop non existing keyspace".

The CREATE KEYSPACE in create_new_test_keyspace already uses IF NOT
EXISTS as a workaround for the same driver bug. This patch applies
the same approach to fix DROP KEYSPACE.

Fixes SCYLLADB-1538

Closes scylladb/scylladb#29487
2026-04-20 12:51:17 +02:00
Gleb Natapov
133768a1f0 db/system_distributed_keyspace: remove system_distributed_everywhere since it is unused 2026-04-20 12:52:25 +03:00
Botond Dénes
ad7647c3c7 test/commitlog: reduce resource usage in test_commitlog_handle_replayed_segments
The test was using max_size_mb = 8*1024 (8 GB) with 100 iterations,
causing it to create up to 260 files of 32 MB each per iteration via
fallocate. On a loaded CI machine this totals hundreds of GB of file
operations, easily exceeding the 15-minute test timeout (SCYLLADB-1496).

The test only needs enough files to verify that delete_segments keeps
the disk footprint within [shard_size, shard_size + seg_size]. Reduce
max_size_mb to 128 (8 files of 32 MB per iteration) and the iteration
count to 10, which is sufficient to exercise the serialized-deletion
and recycle logic without imposing excessive I/O load.

Closes scylladb/scylladb#29510
2026-04-20 11:02:25 +03:00
Ernest Zaslavsky
e5e6608f20 sstables_loader: prevent use-after-free on table drop during streaming
sstables_loader::load_and_stream holds a replica::table& reference via
the sstable_streamer for the entire streaming operation.  If the table
is dropped concurrently (e.g. DROP TABLE or DROP KEYSPACE), the
reference becomes dangling and the next access crashes with SEGV.

This was observed in a longevity-50gb-12h-master test run where a
keyspace was dropped while load_and_stream was still streaming SSTables
from a previous batch.

Fix by acquiring a stream_in_progress() phaser guard in load_and_stream
before creating the streamer.  table::stop() calls
_pending_streams_phaser.close() which blocks until all outstanding
guards are released, keeping the table alive for the duration of the
streaming operation.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1352

Closes scylladb/scylladb#29403
2026-04-20 07:39:51 +03:00
Benny Halevy
34adb0e069 test/cluster/dtest: fix test_scrub_static_table flakiness
Pass jvm_args=["--smp", "1"] on both cluster.start() calls to
ensure consistent shard count across restarts, avoiding resharding
on restart. Also pass wait_for_binary_proto=True to cluster.start()
to ensure the CQL port is ready before connecting.

Fixes: SCYLLADB-824

Closes scylladb/scylladb#29548
2026-04-20 06:53:49 +03:00
Avi Kivity
d584bd7358 cql3: statement_restrictions: replace has_eq_restriction_on_column with precomputed set
has_eq_restriction_on_column() walked expression trees at prepare time to
find binary_operators with op==EQ that mention a given column on the LHS.
Its only caller is ORDER BY validation in select_statement, which checks
that clustering columns without an explicit ordering have an EQ restriction.

Replace the 50-line expression-walking free function with a precomputed
unordered_set<const column_definition*> (_columns_with_eq) populated during
the main predicate loop in analyze_statement_restrictions.  For single-column
EQ predicates the column is taken from on_column; for multi-column EQ like
(ck1, ck2) = (1, 2), all columns in on_clustering_key_prefix are included.

The member function becomes a single set::contains() call.
2026-04-19 20:57:09 +03:00
Avi Kivity
b7f86eaabc cql3: statement_restrictions: replace multi_column_range_accumulator_builder with direct predicate iteration
build_get_multi_column_clustering_bounds_fn() used expr::visit() to dispatch
each restriction through a 15-handler visitor struct.  Only the
binary_operator handler did real work; the conjunction handler just
recursed, and the remaining 13 handlers were dead-code on_internal_error
calls (the filter expression of each predicate is always a binary_operator).

Replace the visitor with a loop over predicates that does
as<binary_operator>(pred.filter) directly, building the same query-time
lambda inline.

Promote intersect_all() and process_in_values() from static methods of
the deleted struct to free functions in the anonymous namespace -- they
are still called from the query-time lambda.
2026-04-19 20:57:09 +03:00
Avi Kivity
ece9af229d cql3: statement_restrictions: use predicate fields in build_get_clustering_bounds_fn
Replace find_binop(..., is_multi_column) with pred.is_multi_column in
build_get_clustering_bounds_fn() and add_clustering_restrictions_to_idx_ck_prefix().

Replace is_clustering_order(binop) with pred.order == comparison_order::clustering
and iterate predicates directly instead of extracting filter expressions.

Remove the now-dead is_multi_column() free function.
2026-04-19 20:57:09 +03:00
Avi Kivity
72da1207d7 cql3: statement_restrictions: remove extract_single_column_restrictions_for_column
The previous commit made prepare_indexed_local() use the pre-built
predicate vectors instead of calling extract_single_column_restrictions_for_column().
That was the last production caller.

Remove the function definition (65 lines of expression-walking visitor)
and its declaration/doc-comment from the header.

Replace the unit test (expression_extract_column_restrictions) which
directly called the removed function with synthetic column_definitions,
with per_column_restriction_routing which exercises the same routing
logic through the public analyze_statement_restrictions() API.  The new
test verifies not just factor counts but the exact (column_name, oper_t)
pairs in each per-column entry, catching misrouted restrictions that a
count-only check would miss.
2026-04-19 20:57:09 +03:00
Avi Kivity
b093477cf7 cql3: statement_restrictions: use predicate vectors in prepare_indexed_local
Replace the extract_single_column_restrictions_for_column(_where, ...) call
in prepare_indexed_local() with a direct lookup in the pre-built predicate
vectors.

The old code walked the entire WHERE expression tree to extract binary
operators mentioning the indexed column, wrapped them in a conjunction,
translated column definitions to the index schema, then called
to_predicate_on_column() which walked the expression *again* to convert
back to predicates.

The new code selects the appropriate predicate vector map (PK, CK, or
non-PK) based on the indexed column's kind, looks up the column's
predicates directly, applies replace_column_def to each, and folds them
with make_conjunction -- producing the same result without any expression
tree walks.

This removes the last production caller of
extract_single_column_restrictions_for_column (unit tests in
statement_restrictions_test.cc still exercise it).
2026-04-19 20:57:09 +03:00
Avi Kivity
a725e39218 cql3: statement_restrictions: use predicate vector size for clustering prefix length
Replace the body of num_clustering_prefix_columns_that_need_not_be_filtered()
with a single return of _clustering_prefix_restrictions.size().

The old implementation called get_single_column_restrictions_map() to rebuild
a per-column map from the clustering expression tree, then iterated it in
schema order counting columns until it hit a gap, a needs-filtering predicate,
or a slice.  But _clustering_prefix_restrictions is already built with exactly
that same logic during the constructor (lines 1234-1248): it iterates CK
columns in schema order, appending predicates until it encounters a gap in
column_id, a predicate that needs_filtering, or a slice -- at which point it
stops.  So the vector's size is, by construction, the answer to the same
question the old code was re-deriving at query time.

This makes four helper functions dead code:

- get_single_column_restrictions_map(): walked the expression tree to build
  a map<column_definition*, expression> of per-column restrictions.  Was a
  ~15-line function that called get_sorted_column_defs() and
  extract_single_column_restrictions_for_column() for each column.

- get_the_only_column(): extracted the single column_value from a restriction
  expression, asserting it was single-column.  Called by the old loop body.

- is_single_column_restriction(): thin wrapper around
  get_single_column_restriction_column().

- get_single_column_restriction_column(): ~25-line function that walked an
  expression tree with for_each_expression<column_value> to determine whether
  all column_value nodes refer to the same column.  Called by the above two.

Remove all four functions and their forward declarations (-95 lines).
2026-04-19 20:57:08 +03:00
Avi Kivity
68c2e292ac cql3: statement_restrictions: replace do_find_idx and is_supported_by with predicate-based versions
Convert do_find_idx() from a member function that walks expression trees
via index_restrictions()/for_each_expression/extract_single_column_restrictions
to a static free function that iterates index_search_group spans using
are_predicates_supported_by().

Convert calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index()
to use predicate vectors instead of expression-based is_supported_by().

Remove now-dead code: is_supported_by(), is_supported_by_helper(), score()
member function, and do_find_idx() member function.
2026-04-19 20:57:08 +03:00
Avi Kivity
c42397e995 cql3: statement_restrictions: remove expression-based has_supporting_index and index_supports_some_column
These functions are no longer called now that all index support checks
in the constructor use predicate-based alternatives. The expression-based
is_supported_by and is_supported_by_helper are still needed by choose_idx()
and calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index().
2026-04-19 20:57:08 +03:00
Avi Kivity
1aafe0708a cql3: statement_restrictions: replace multi-column and PK index support checks with predicate-based versions
Replace clustering_columns_restrictions_have_supporting_index(),
multi_column_clustering_restrictions_are_supported_by(),
get_clustering_slice(), and partition_key_restrictions_have_supporting_index()
with predicate-based equivalents that use the already-accumulated mc_ck_preds
and sc_pk_pred_vectors locals.

The new multi_column_predicates_have_supporting_index() checks each
multi-column predicate's columns list directly against indexes, avoiding
expression tree walks through find_in_expression and bounds_slice.
2026-04-19 20:57:08 +03:00
Avi Kivity
fa6f239cc7 cql3: statement_restrictions: add predicate-based index support checking
Add `op` and `is_subscript` fields to `struct predicate` and populate them
in all predicate creation sites in `to_predicates()`. These fields record the
binary operator and whether the LHS is a subscript (map element access), which
are the two pieces of information needed to query index support.

Add `is_predicate_supported_by()` which mirrors `is_supported_by_helper()`
but operates on a single predicate's fields instead of walking the expression
tree.

Add a predicate-vector overload of `index_supports_some_column()` and use it
in the constructor to replace expression-based index support checks for
single-column partition key, clustering key, and non-primary-key restrictions.
The multi-column clustering key case still uses the existing expression-based
path.
2026-04-19 20:57:08 +03:00
Avi Kivity
25ba3bd649 cql3: statement_restrictions: use pre-built single-column maps for index support checks
Replace index_supports_some_column(expression, ...) with
index_supports_some_column(single_column_restrictions_map, ...) to
eliminate get_single_column_restrictions_map() tree walks when checking
index support.  The three call sites now use the maps already built
incrementally in the constructor loop:
_single_column_nonprimary_key_restrictions,
_single_column_clustering_key_restrictions, and
_single_column_partition_key_restrictions.

Also replace contains_multi_column_restriction() tree walk in
clustering_columns_restrictions_have_supporting_index() with
_has_multi_column.
2026-04-19 20:57:08 +03:00
Avi Kivity
fab90224b3 cql3: statement_restrictions: build clustering-prefix restrictions incrementally
Replace the extract_clustering_prefix_restrictions() tree walk with
incremental collection during the main loop.  Two new locals --
mc_ck_preds and sc_ck_preds -- accumulate multi-column and single-column
clustering key predicates respectively.  A short post-loop block
computes the longest contiguous prefix from sc_ck_preds (or uses
mc_ck_preds directly for multi-column), replacing the removed function.

Also remove the now-unused to_predicate_on_clustering_key_prefix(),
with_current_binary_operator() helper, and the
visitor_with_binary_operator_context concept.
2026-04-19 20:57:08 +03:00
Avi Kivity
3bd308986a cql3: statement_restrictions: build partition-range restrictions incrementally
Replace the extract_partition_range() tree walk with incremental
collection during the main loop.  Two new locals before the loop --
token_pred and pk_range_preds -- accumulate token and single-column
EQ/IN partition key predicates respectively.  A short post-loop block
materializes _partition_range_restrictions from these locals, replacing
the removed function.

This removes the last tree walk over partition-key restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
db28411548 cql3: statement_restrictions: build clustering-key single-column restrictions map incrementally
Instead of accumulating all clustering-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each single-column clustering-key
predicate is processed.

The post-loop guard (!has_mc_clustering) is no longer needed:
multi-column predicates go through the is_multi_column branch
and never insert into this map, and mixing multi with single-column
is rejected with an exception.

This eliminates a post-loop tree walk over
_clustering_columns_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
a4608804d8 cql3: statement_restrictions: build partition-key single-column restrictions map incrementally
Instead of accumulating all partition-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each single-column partition-key
predicate is processed.

The post-loop guard (!has_token_restrictions()) is no longer needed:
token predicates go through the on_partition_key_token branch and
never insert into this map, and mixing token with non-token is
rejected with an exception.

This eliminates a post-loop tree walk over
_partition_key_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
e9b16a11ba cql3: statement_restrictions: build non-primary-key single-column restrictions map incrementally
Instead of accumulating all non-primary-key restrictions into a
conjunction tree and then decomposing it by column via
get_single_column_restrictions_map() post-loop, build the
per-column map incrementally as each non-primary-key predicate
is processed.

This eliminates a post-loop tree walk over _nonprimary_key_restrictions.
2026-04-19 20:57:08 +03:00
Avi Kivity
701366a8d1 cql3: statement_restrictions: use tracked has_mc_clustering for _has_multi_column
Replace the two post-loop find_binop(_clustering_columns_restrictions,
is_multi_column) tree walks and the contains_multi_column_restriction()
tree walk with the already-tracked local has_mc_clustering.

The redundant second assignment inside the _check_indexes block is
removed entirely.
2026-04-19 20:57:08 +03:00
Avi Kivity
da438507d0 cql3: statement_restrictions: track has-token state incrementally
Replace the two in-loop calls to has_token_restrictions() (which
walks the _partition_key_restrictions expression tree looking for
token function calls) with a local bool has_token, set to true
when a token predicate is processed.

The member function is retained since it's used outside the
constructor.

With this change, the constructor loop's non-error control flow
performs zero expression tree scanning.  The only remaining tree
walks are on error paths (get_sorted_column_defs,
get_columns_in_commons for formatting exception messages) and
structural (make_conjunction for building accumulated expressions).
2026-04-19 20:57:07 +03:00
Avi Kivity
1344278a19 cql3: statement_restrictions: track partition-key-empty state incrementally
Replace the in-loop call to partition_key_restrictions_is_empty()
(which walks the _partition_key_restrictions expression tree via
is_empty_restriction()) with a local bool pk_is_empty, set to false
at the two sites where partition key restrictions are added.

The member function is retained since it's used outside the
constructor.
2026-04-19 20:57:07 +03:00
Avi Kivity
14812ea1e0 cql3: statement_restrictions: track first multi-column predicate incrementally
Replace find_in_expression<binary_operator>(_clustering_columns_restrictions,
always_true), which walks the accumulated expression tree to find the
first binary_operator, with a tracked pointer first_mc_pred set when
the first multi-column predicate is added. This eliminates the tree
scan, the null check, and the is_lower_bound/is_upper_bound lambdas,
replacing them with direct predicate field accesses: first_mc_pred->order,
first_mc_pred->is_lower_bound, first_mc_pred->is_upper_bound, and
first_mc_pred->filter for error messages.
2026-04-19 20:57:07 +03:00
Avi Kivity
ef005c10ba cql3: statement_restrictions: track last clustering column incrementally
Replace get_last_column_def(_clustering_columns_restrictions), which
walks the entire accumulated expression tree to collect and sort all
column definitions, with a local pointer ck_last_column that tracks
the column with the highest schema position as single-column
clustering restrictions are added.
2026-04-19 20:57:07 +03:00
Avi Kivity
88bd5ea1b7 cql3: statement_restrictions: track clustering-has-slice incrementally
Replace has_slice(_clustering_columns_restrictions), which walks the
accumulated expression tree looking for slice operators, with a local
bool ck_has_slice set when any clustering predicate with is_slice is
added. Updated at all three clustering insertion points: multi-column
first assignment, multi-column slice conjunction, and single-column
conjunction.
2026-04-19 20:57:07 +03:00
Avi Kivity
1071c39f17 cql3: statement_restrictions: track has-multi-column-clustering incrementally
Replace find_binop(_clustering_columns_restrictions, is_tuple_constructor),
which walks the accumulated expression tree looking for multi-column
restrictions, with a local bool has_mc_clustering set when a multi-column
predicate is first added. This serves both the multi-column branch
(checking existing restrictions are also multi-column) and the
single-column branch (checking no multi-column restrictions exist).
2026-04-19 20:57:07 +03:00
Avi Kivity
aa6a0ad326 cql3: statement_restrictions: track clustering-empty state incrementally
Replace is_empty_restriction(_clustering_columns_restrictions), which
recursively walks the accumulated expression tree, with a local bool
ck_is_empty that is set to false when a clustering restriction is
first added. Updated at both insertion points: multi-column first
assignment and single-column make_conjunction.
2026-04-19 20:57:07 +03:00
Avi Kivity
d4ff613c0a cql3: statement_restrictions: replace restr bridge variable with pred.filter
The constructor loop no longer needs to extract a binary_operator
reference from each predicate. All remaining uses (make_conjunction,
get_columns_in_commons, assignment to accumulated restriction members,
_where.push_back, and error formatting) accept expression directly,
which is what pred.filter already is. This eliminates the unnecessary
as<binary_operator> cast at the top of the loop.
2026-04-19 20:57:07 +03:00
Avi Kivity
44b18f3399 cql3: statement_restrictions: convert single-column branch to use predicate properties
In the single-column partition-key and clustering-key sub-branches,
replace direct binary_operator field inspections with pre-computed
predicate booleans: !pred.equality && !pred.is_in instead of
restr.op != EQ && restr.op != IN, pred.is_in instead of
find(restr, IN), and pred.is_slice instead of has_slice(restr).
Also fix a leftover restr.order in the multi-column branch error
message.
2026-04-19 20:57:07 +03:00
Avi Kivity
b0c5eed384 cql3: statement_restrictions: convert multi-column branch to use predicate properties
Replace direct operator comparisons with predicate boolean fields:
pred.equality, pred.is_in, pred.is_slice, pred.is_lower_bound,
pred.is_upper_bound, and pred.order.
2026-04-19 20:57:07 +03:00
Avi Kivity
afd68187ea cql3: statement_restrictions: convert constructor loop to iterate over predicates
Convert the constructor loop to first build predicates from the
prepared where clause, then iterate over the predicates.

The IS_NOT branch now uses pred.is_not_null_single_column and pred.on
instead of inspecting the expression directly. The branch conditions
for multi-column (pred.is_multi_column), token
(on_partition_key_token), and single-column (on_column) now use
predicate properties instead of expression helpers.

Remove extract_column_from_is_not_null_restriction() which is no
longer needed.
2026-04-19 20:57:07 +03:00
Avi Kivity
440d9f2d82 cql3: statement_restrictions: annotate predicates with operator properties
Add boolean fields to struct predicate that describe the operator:
equality, is_in, is_slice, is_upper_bound, is_lower_bound, and
comparison_order. Populate them in all to_predicates() return sites.

These fields will allow the constructor loop to inspect predicate
properties directly instead of re-examining the expression.
2026-04-19 20:57:07 +03:00
Avi Kivity
e0eb3bde8d cql3: statement_restrictions: annotate predicates with is_not_null and is_multi_column
To avoid having to dig deep into the expression, compute is_not_null
and is_multicolumn early and store them in the predicate.
2026-04-19 20:57:06 +03:00
Avi Kivity
6892642176 cql3: statement_restrictions: complete preparation early
We want to move away from the unprepared domain to the prepared
domain to avoid confusion. Ideally we'd receive prepared expressions
via the constructor, but that is left for later.
2026-04-19 20:57:06 +03:00
Avi Kivity
ed5dd645e8 cql3: statement_restrictions: convert expressions to predicates without being directed at a specific column
Currently, possible_lhs_values accepts a column_definition parameter
that tells it which column we are interested in. This works
because callers pre-analyze the expression and only pass a
subexpression that contains the specified columns.

We wish to convert expressions to predicates early, and so won't
have the benefit of knowing which columns we're interested in.

Generally, this is simple: a binary operator contains a column on the
left-hand side, so use that. If the expression is on a token, use that.

When the expression is a boolean constant (not expressible by
the grammar, but somehow found its way into the code). We invent
a new `on_row` designator meaning it's not about a specific column.
It will be useful one day when we allow things like
`WHERE some_boolean_function(c1, c2)` that aren't specific to any
single column.

Finally, we introduce helpers that, given such an expression decomposed
into predicates and a column_definition, extract the predicate related
to the given column. This mimics the possible_lhs_values API and allows
us to make minimal changes to callers, deferring that until later.

possible_lhs_values() is renamed to to_predicates() and loses the
column_definition parameter to indicate its new role.
2026-04-19 20:57:06 +03:00
Avi Kivity
bfd1302311 cql3: statement_restrictions: refine possible_lhs_values() function_call processing
Currently, we are careful to call possible_lhs_values() for a token
function only when slice/equality operators are used. We wish to relax
this, so return nullptr (must filter) for the other cases instead of
raising an internal error.
2026-04-19 20:57:06 +03:00
Avi Kivity
736011b663 cql3: statement_restrictions: return nullptr for function solver if not token
Currently, possible_lhs_values() for a function call expression will
only be called when we're sure it's the token() function. But soon this
will no longer be the case. Return nullptr for non-token functions to
indicate we can't solve for a column value instead of an internal
error.
2026-04-19 20:57:06 +03:00
Avi Kivity
8faf62a1aa cql3: statement_restrictions: refine possible_lhs_values() subscript solving
Do more work at prepare time.
2026-04-19 20:57:06 +03:00
Avi Kivity
a28689a99a cql3: statement_restrictions: return nullptr from possible_lhs_values instead of on_internal_error
Since we're a first-resort call now, and there's a last-restort (evaluate)

Logically should be part of previous patch, but the rest of the code is still
careful enough not to call here when not expecting a solution, so the split
is not breaking bisectability.
2026-04-19 20:57:06 +03:00
Avi Kivity
370f3fd2e8 cql3: statement_restrictions: convert possible_lhs_values into a solver
Convert from an execute-time function to a prepare-time function
by returning a solver function instead of directly solving.

When not possible to solve, but still possible to evaluate (filter),
return nullptr.
2026-04-19 20:57:06 +03:00
Avi Kivity
92a43557dc cql3: statement_restrictions: split _where to boolean factors in preparation for predicates conversion
Expressions are a tree-like structure so a single expression is sufficient
(for complicated ones, a conjunction is used), but predicates are flat.
Prepare for conversion to predicates by storing the expressions that
will correspond to predicates, namely the boolean factors of the WHERE
clause.
2026-04-19 20:57:06 +03:00
Avi Kivity
694c1aed98 cql3: statement_restrictions: refactor IS NOT NULL processing
Move some code to a helper, but don't let it mutate state.
2026-04-19 20:57:06 +03:00
Avi Kivity
35f14544dc cql3: statement_restrictions: fold add_single_column_nonprimary_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:06 +03:00
Avi Kivity
1965741914 cql3: statement_restrictions: fold add_single_column_clustering_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:06 +03:00
Avi Kivity
1d631f7bac cql3: statement_restrictions: fold add_single_column_partition_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
24cd98e454 cql3: statement_restrictions: fold add_token_partition_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
be3239fc58 cql3: statement_restrictions: fold add_multi_column_clustering_key_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
8990346c75 cql3: statement_restrictions: avoid early return in add_multi_column_clustering_key_restrictions
Prepare for inlining it into its caller, which doesn't work easily if there's
an early return.
2026-04-19 20:57:05 +03:00
Avi Kivity
fa130051a6 cql3: statement_restrictions: fold add_is_not_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
63f9362c89 cql3: statement_restrictions: fold add_restriction() into its caller
The goal is to simplify flow-control where the order in which
variables are updated depends on their location in the source.
With functions, this is difficult.
2026-04-19 20:57:05 +03:00
Avi Kivity
9cbb1b851e cql3: statement_restrictions: remove possible_partition_token_values()
It's just a call to possible_lhs_values() with a different signature.

Now possible_lhs_values() is our only solver.
2026-04-19 20:57:05 +03:00
Avi Kivity
c1fc596203 cql3: statement_restrictions: remove possible_column_values
replace with now-identical possible_lhs_values. This paves the way
to have only one solver function (after we remove
possible_partition_token_values).
2026-04-19 20:57:05 +03:00
Avi Kivity
b26e6f7330 cql3: statement_restrictions: pass schema to possible_column_values()
This unifies the signature with possible_lhs_values(), paving the way
to deduplicating the two functions. We always have the schema and may as
well pass it.
2026-04-19 20:57:05 +03:00
Avi Kivity
c6f6e81fe5 cql3: statement_restrictions: remove fallback path in solve()
All query plans that try to solve for the possible values a column
(or token, or column-tuple) can take have been converted to set
analyzed_column::solve_for. Recognize that by removing the
fallback path.

This removes the last possible_column_values() call that isn't bound
(using std::bind_front), and will allow moving it to prepare time.
2026-04-19 20:57:05 +03:00
Avi Kivity
e0445269e5 cql3: statement_restrictions: reorder possible_lhs_column parameters
By moving query_options to the end, we can use std::bind_front to
convert it from a build-time to a run-time function that depends
only on the query_options.
2026-04-19 20:57:05 +03:00
Avi Kivity
e42ad62561 cql3: statement_restrictions: prepare solver for multi-column restrictions
Multi-column restrictions (a, b) > (:v1, :v2) do not obey normal
comparison rules. For example, given

 (a, b) > (5, 1) AND a <= 5

We see that (a, b) = (5, 2) satisfies the constraint, but if we tried
to solve for the interval

 ( (5, 1), (5) ]

We'd have to conclude that (5,1) <= (5).

It's possible to extend the CQL type system to support this, but
that would be a lot of work, and in fact the current code doesn't
depend on it (by solving these intersections in its own code path
(multi_column_range_accumulator_builder's prefix3cmp).

So, we just mark such solvers as non-comparable, and generate an
internal error if we try to compare them in make_conjunction.
2026-04-19 20:57:05 +03:00
Avi Kivity
96e8414963 cql3: statement_restrictions: add solver for token restriction on index
possible_column_values() knows how to find the values that the token can
take, so add a solve_for implementation for tokens.
2026-04-19 20:57:04 +03:00
Avi Kivity
135809d97b cql3: statement_restrictions: pre-analyze column in value_for()
Since we pre-analyze the column, return a built function, and remove
the corresponding lambda from the caller.
2026-04-19 20:57:04 +03:00
Avi Kivity
0a16d90acb cql3: statement_restrictions: don't handle boolean constants in multi_column_range_accumulator_builder
In statement_restriction's constructor, we check that all the boolean factors
are relations. This means the code to handle a constant here is dead code.

Remove it; while it's good to handle it, it should be handled at the top
level, not in multi-column restriction processing.
2026-04-19 20:57:04 +03:00
Avi Kivity
56ae02d8a3 cql3: statement_restrictions: split range_from_raw_bounds into prepare phase and query phase
range_from_raw_bound processes restrictions of the form

   (a, b) > SCYLLA_CLUSTERING_BOUND(?, ?)

indicating that comparisons respect whether columns are reversed or not.

Iterate over expressions during the prepare phase only; generating
"builder" functions to be executed during the query phase.
2026-04-19 20:57:04 +03:00
Avi Kivity
2c75123bbd cql3: statement_restrictions: adjust signature of range_from_raw_bounds
The get_clustering_bounds() family works in terms of vectors of
clustering ranges (to support IN) and in fact the only caller converts
it to a vector. Converting it immediately simplifies later patching.
2026-04-19 20:57:04 +03:00
Avi Kivity
e646b763e7 cql3: statement_restrictions: split multi_column_range_accumulator into prepare-time and query-time phases
multi_column_range_accumulator analyzes an expression containing multi-column
restrictions of the form (a, b) > (?, ?) and simultaneously analyzes
them and solves for the set of intervals that satisfy those restrictions.

Split this into prepare-time phase (that generates "builders", functions
that operator on the accumulator), and a query phase that executes
the builders. Importantly, the expression visitor ends up on the prepare
phase, so it can be merged with other parts of the analysis.

Helper functions of the visitor are made static, since they need to
run during the query phase but the visitor only exists during the
prepare phase.
2026-04-19 20:57:04 +03:00
Avi Kivity
ea26186043 cql3: statement_restrictions: make get_multi_column_clustering_bounds a builder
Lay the groundwork for analyzing multi column clustering bounds by
splitting the function into prepare-time and execute-time parts.
To start with, all of the work is done at query time, but later
patches will move bits into prepare time.
2026-04-19 20:57:04 +03:00
Avi Kivity
c60e3d5cf7 cql3: statement_restrictions: multi-key clustering restrictions one layer deeper
For the multi column binary operator case, perform more of the work at
prepare time in preparation for consolidating the analysis.
2026-04-19 20:57:04 +03:00
Avi Kivity
b520e74128 cql3: statement_restrictions: push multi-column post-processing into get_multi_column_clustering_bounds()
Doing this splits the multi-column processing code into a preparation
phase and an evaluation phase in a single call, making it easier to
further split prepare/evaluate.
2026-04-19 20:57:04 +03:00
Avi Kivity
c4ab0ddb85 cql3: statement_restrictions: pre-analyze single-column clustering key restrictions
Change _clustering_prefix_restrictions and _idx_tbl_ck_prefix
(the latter is the equivalent of the former, for indexed queries),
to use predicate instead of expressions. This lets us do
more of the work of solving restrictions during prepare time.

We only handle single-column restrictions here. Multi-column
restrictions use the existing path.

We introduce two helpers:
 - value_set_to_singleton() converts a restriction solution to a singleton
   when we know that's the only possible answer
 - replace_column_def() overload for predicate, similar to the
   existing overload for expressions

There is a wart in get_single_column_clustering_bounds(): we arrive at
his point with the two vectors possibly pointing at different
columns. Previously, possible_lhs_values() did this check while solving.
We now check for it here.

The predicate::on variant gets another member, for clustering key prefixes.
Since everything is still handled by the legacy paths, we mostly
error out.
2026-04-19 20:57:04 +03:00
Avi Kivity
201ed53837 cql3: statement_restrictions: wrap value_for_index_partition_key()
To allow more work to be carried out during prepare time, wrap
the body in an std::function, which will be called at execution time.

Currently we actually do the work during execution time; but the
way is prepared.
2026-04-19 20:57:04 +03:00
Avi Kivity
325497d460 cql3: statement_restrictions: hide value_for()
value_for() is a general function that solves for values that
satisfy an expression set to TRUE. This goes against our goal to
prepare solvers for all the expressions we use. Fortunately, it's only
called with one expression, which comes from statement_restrictions, so
we can add an accessor that provides the expression from our own state.
Later, we'll be able to do prepare-time work on it.
2026-04-19 20:57:04 +03:00
Avi Kivity
dcdd2f7e72 cql3: statement_restrictions: push down clustering prefix wrapper one level
This allows us to tackle each case separately.
2026-04-19 20:57:03 +03:00
Avi Kivity
1039ed9ed2 cql3: statement_restrictions: wrap functions that return clustering ranges
During prepare time, build functions for use during execution time.

Currently, the wrappers are very shallow, and practically all the
work is done at execution time. But the stage is set for more peeling.

The index clustering ranges had on_internal_error()s if an index
was not used. They're converted to returning a null function. If
executed (which is never supposed to happen), it will throw
a bad_function_call.
2026-04-19 20:57:03 +03:00
Avi Kivity
620df7103f cql3: statement_restrictions: do not pass view schema back and forth
For indexed queries, statement_restrictions calculates _view_schema,
which is passed via get_view_schema() to indexed_select_statement(),
which passes it right back to statement_restrictions via one of three
functions to calculate clustering ranges.

Avoid the back-and-forth and use the stored value. Using a different
value would be broken.

This change allows unifying the signatures of the four functions that
get clustering ranges.
2026-04-19 20:57:03 +03:00
Avi Kivity
6fce090e30 cql3: statement_restrictions: pre-analyze token range restrictions
Convert token range restrictions to the predicate format we
introduced earlier, where we have a function to solve for the token
range rather than running the analysis at runtime. Again the truth is
that the function will delegate to possible_partition_token_values()
which actually will do the analysis at runtime, but it's one step closer.

We add a new variant element for predicate::on, since it doesn't
fit the existing element (the token isn't a column).
2026-04-19 20:57:03 +03:00
Avi Kivity
941011bb4a cql3: statement_restrictions: pre-analyze partition key columns
The expression tree for partition keys is analyzed during runtime:
in partition_range_from_singles() (for example), we call find_binop
and get_subscripted_column() to understand the expression structure.

This analysis is problematic because it has to match the analysis
during prepare time; and they have to evolve in lock step.

Here, we move the analysis to the prepare stage. This is done
by augmenting the expression into a new predicate struct. It
contains the original expression (as a fallback for paths not yet
converted), as well as a solve_for function which contains
a function built at prepare time that embeds all the necessary analysis.

We introduce the `predicate` type which is an augmentation
of boolean expressions. In addition to the expression, we remember
what column the expression is on, and a function that computes
what values the column can take on that would make the expression
true.

The field that says what column the predicate is about is typed
as a variant since later on we will have predicates on non-columns
(the token, or a clustering prefix).

Note that currently the function engages in some run-time analysis of
its own, since it calls possible_lhs_values that itself does analysis,
but this is a step in the right direction.
2026-04-19 20:57:03 +03:00
Avi Kivity
c73f3ac55f cql3: statement_restrictions: do not collect subscripted partition key columns
An indexed SELECT of the from

SELECT ...
WHERE pk['sub'] = ?

is impossible because our indexes do not support frozen maps, and
partition key collections must be frozen. Stop collecting such constructs
for the purpose of determining the partition range. This reduces having
to deal with combinations of restrictions on the column and its entries
later on.

In case we start supporting indexes on frozen maps, leave an
on_internal_error to remind us.
2026-04-19 20:57:03 +03:00
Avi Kivity
531f137ed3 cql3: statement_restrictions: split _partition_range_restrictions into three cases
_partition_range_restrictions are a vector of expressions, one per
partition key column, except that it can be empty if there is no
restriction on the partition that can be translated to a read command,
and if the restriction is on a token range, the first element only
is used.

Separate the three cases into distinct structs. After this, additional
work can be done utilizing the specialization.
2026-04-19 20:57:03 +03:00
Avi Kivity
fcf7c4c90d cql3: statement_restrictions: move value_list, value_set to header file
They don't really need to be public, but will be used in intermediate
storage.
2026-04-19 20:57:03 +03:00
Avi Kivity
926886fcfb cql3: statement_restrictions: wrap get_partition_key_ranges
statement_restrictions::get_partition_key_ranges() re-interprets
the expressions used to specify the partition key. This means that
the analysis phase (determining what those expressions are and how
they are to be used) and the execution phase (using them) are in separate
places. This makes it very hard to refactor while preserving correctness.

As a first step in unifying the two phases, we move the selection
of the strategy (using token, cartesian product, or single partition)
from execution to analysis, by making the if-tree return a function to
be executed at execution time, rather than running the if-tree itself
at execution time.
2026-04-19 20:57:03 +03:00
Avi Kivity
eec0b20dbc cql3: statement_restrictions: prepare statement_restrictions for capturing this
Prevent copying/moving, that can change the address, and instead enforce
using shared_ptr. Most of the code is already using shared_ptr, so the
changes aren't very large.

To forbid non-shared_ptr construction, the constructors are annotated
with a private_tag tag class.
2026-04-19 20:57:03 +03:00
Avi Kivity
374be94faa test: statement_restrictions: add index_selection regression test
In preparation for refactoring statement_restrictions, add a simple
and an exhaustive regression test, encoding the index selection
algorithm into the test. We cannot change the index selection algorithm
because then mixed-node clusters will alter the sorting key mid-query
(if paging takes place).

Because the exhaustive space has such a large stack frame, and
because Address Santizer bloats the stack frame, increase it
for debug builds.
2026-04-19 20:57:01 +03:00
Artsiom Mishuta
dce0c24a02 test/alternator: replace bare pytest.skip() with typed skip helpers 2026-04-19 17:34:41 +02:00
Artsiom Mishuta
b078cd1e72 test: migrate new bare skips introduced by upstream after rebase
Migrate 3 bare skip sites that appeared in upstream/master after the
initial migration:

- test/cluster/test_strong_consistency.py: 2 @pytest.mark.skip →
  @pytest.mark.skip_bug (SCYLLADB-1056)
- test/cqlpy/conftest.py: pytest.skip() → skip_env() in
  skip_on_scylla_vnodes fixture
2026-04-19 17:34:41 +02:00
Artsiom Mishuta
9c4d3ce097 test/pylib: reject bare pytest.mark.skip and add codebase guards
Harden the skip_reason_plugin to reject bare @pytest.mark.skip at
collection time with pytest.UsageError instead of warnings.warn().

Add test/pylib_test/test_no_bare_skips.py with three guard tests:
- AST scan for bare pytest.skip() runtime calls
- Real pytest --collect-only against all Python test directories
2026-04-19 17:34:31 +02:00
Artsiom Mishuta
0b6b380b80 test: update comments referencing pytest.skip() to skip_env()
Update 7 comments/docstrings across 5 files that still referenced
pytest.skip() to reference the typed skip_env() wrapper for
consistency with the migrated code.
2026-04-19 11:14:03 +02:00
Artsiom Mishuta
b10028e556 test: migrate runtime pytest.skip() to typed skip_bug()
Migrate 2 runtime pytest.skip() calls referencing known bugs to use
the typed skip_bug() wrapper from test.pylib.skip_types:

- test/alternator/test_ttl.py: Streams on tablets (#23838)
- test/scylla_gdb/test_task_commands.py: coroutine task not found (#22501)
2026-04-19 11:10:42 +02:00
Artsiom Mishuta
8a80e2c3be test: migrate runtime pytest.skip() to typed skip_env()
Migrate runtime pytest.skip() calls across 34 files to use the typed
skip_env() wrapper from test.pylib.skip_types.

These sites skip at runtime because a required feature, config option,
library version, build mode, or runtime topology is not available.

Also fixes 'raise pytest.skip(...)' in test_audit.py — skip_env()
already raises internally, so the explicit raise was incorrect.

Each file gains one new import:
  from test.pylib.skip_types import skip_env
2026-04-19 11:09:29 +02:00
Artsiom Mishuta
fb0974a329 test: migrate bare @pytest.mark.skip to skip_not_implemented
Migrate 2 bare @pytest.mark.skip decorators (no reason string) to
@pytest.mark.skip_not_implemented with an explicit reason referencing
issue #3882 (COMPACT STORAGE not implemented).
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
a39fb9d29a test: migrate @pytest.mark.skip to @pytest.mark.skip_slow
Migrate 4 @pytest.mark.skip decorator sites to @pytest.mark.skip_slow
across 3 test files where the skip reason indicates a slow test.
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
638efedc3c test: migrate @pytest.mark.skip to @pytest.mark.skip_not_implemented
Migrate 10 @pytest.mark.skip decorator sites to
@pytest.mark.skip_not_implemented across 5 test files where the
skip reason indicates a feature not yet implemented.
2026-04-19 11:06:30 +02:00
Artsiom Mishuta
465636bc53 test: migrate @pytest.mark.skip to @pytest.mark.skip_bug for known bugs
Migrate 24 @pytest.mark.skip decorator sites to @pytest.mark.skip_bug
across 16 test files where the reason references a known bug or issue.
2026-04-19 11:06:30 +02:00
Nadav Har'El
0d05e3b4a4 alternator: fix ListStreams paging if table is deleted during paging
Currently, ListStreams paging works by looking in the list of tables
for ExclusiveStartStreamArn and starting there. But it's possible
that during the paging process, one of the tables got deleted and
ExclusiveStartStreamArn no longer points to an existing table. In
the current implementation this caused the paging to stop (think
it reached the end).

The solution is simple: ListStreams will now sort the list of tables
by name (it anyway needs to be sorted by something to be consistent
across pages), and will look with std::upper_bound for the first
table *after* the ExclusiveStartStreamArn - we don't need to find
that table name itself.

The patch also includes a test reproducing this bug. As usual, the
test passes on DynamoDB, fails on Alternator before this patch,
and passes with the patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:02 +03:00
Nadav Har'El
930fb4c330 test/alternator: test DescribeStream on non-existent table
We already had a test for DescribeStream being called on a bogus ARN
returns a ValidationException. But if the stream is more legitimate-
looking but refers to a non-existent table (e.g., an ARN taken in the
past from a table that no longer exists), we should return
ResourceNotFoundException. In this patch we add a test that verifies
we indeed do this correctly.

Moreover, Alternator's current stream ARNs include both a keyspace
name and a table name, and either one being incorrect should lead
to ResourceNotFoundException, and indeed the new test validates
that it works as expected - there is no bug here (AI guessed we
have a bug in the missing *keyspace* case, but this guess was wrong).
2026-04-19 09:12:02 +03:00
Nadav Har'El
02d474fca8 alternator: ListStreams: on last page, avoid LastEvaluatedStreamArn
When ListStreams is on its last page and ran out streams to list,
it shouldn't return a paging cookie (LastEvaluatedStreamArn) at all.
Before this patch it does, and forces the user to make another call
just to get another empty page, which is silly.

This patch includes a fix and a reproducer test (that, as usual, passes
on DynamoDB and fails on Alternator before the patch and succeeds
after).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:02 +03:00
Nadav Har'El
68b783103e alternator: remove dead code stream_shard_id
The class "stream_shard_id" was used in the past (with the old name
stream_arn) for representing stream ARNs. It was renamed
"stream_shard_id" under the mistaken believe that it will be used to
represent DynamoDB Streams "shards" - but it wasn't used for that
either (we have a separate "struct shard_id" in the code).

So this class is now dead code and can be removed.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-19 09:12:01 +03:00
Nadav Har'El
1ac910c2ab alternator: fix ListStreams to return real ARN as LastEvaluatedStreamArn
Alternator Streams' "ListStreams" does paging by returning a "cookie"
LastEvaluatedStreamArn from one request, that the user passes to the
next request as ExclusiveStartStreamArn.

In the past, Alternator's stream ARNs were UUIDs, but we recently
changed them to match DynamoDB's ARN format which the KCL library
requires. However, we didn't change ListStream's cookie format,
and it remained UUIDs.

This, however, goes against the documentation of DynamoDB, which
states that LastEvaluatedStreamArn should be "the stream ARN of
the item where the operation stopped". It shouldn't be some weird
opaque cookie.

So in this patch we add a test that confirms that indeed, in DynamoDB
the LastEvaluatedStreamARN is really the last returned ARN and not
an opaque cookie. The new test passes on DynamoDB, and fails on
Alternator before the simple fix that this patch then does.

Fixes SCYLLADB-539.
2026-04-19 09:12:01 +03:00
Piotr Smaron
218f8adc8f transport: add per-service-level cql_requests_serving metric
Add a per-scheduling-group gauge that tracks the number of in-flight CQL
requests for each service level. The existing scylla_transport_requests_serving
metric is a single global per-shard counter; the new metric breaks it down
by scheduling group so operators can see which service level contributes
the most in-flight requests when debugging latency.

The metric is named cql_requests_serving (exposed as
scylla_transport_cql_requests_serving) following the cql_ prefix convention
used by all other per-scheduling-group transport metrics (cql_requests_count,
cql_request_bytes, cql_response_bytes, cql_pending_response_memory). Using
a cql_ prefix avoids Prometheus confusion with the global requests_serving
metric, which lacks the scheduling_group_name label.

The counter is incremented when a request enters process_request() and
decremented in the same 'leave' defer block as the global requests_serving,
ensuring the request is counted as in-flight until the response is sent.
2026-04-17 15:07:14 +02:00
Piotr Smaron
4988077249 transport: move requests_serving decrement to after response is sent
The requests_serving metric was decremented right after query processing
completed, but before the response was written to the client. This means
requests whose responses were queued in the write pipeline were no longer
counted as in-flight, understating the actual load.

Move the decrement into the 'leave' defer block, which fires after the
response is fully sent via _ready_to_respond. This makes the shedding
check (max_concurrent_requests_per_shard) more accurate: requests that
have finished processing but are still waiting in the response queue now
correctly count toward the in-flight limit.
2026-04-17 15:05:29 +02:00
Aleksandra Martyniuk
b4c0ad20cf service: fix indentation 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
88c55cf7ed docs: update documentation 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
2c0de7d9b3 test: test multi RF changes 2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1b2b453782 service: tasks: allow aborting ongoing RF changes
Allow aborting an ongoing RF change using task manager.

RF change can only be aborted if:
- it is currently paused (existing);
- it is a multi-RF change that still has replicas to be added.

In the second case, we set error for the request in system.topology_requests
and set next_replication to replication_v2. This makes load balancer
roll back the RF change.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
38bad5f316 cql3: allow changing RF by more than one when adding or removing a DC
rf_rack_valid_keyspaces relies on the fact that replicas of base
table and mv are streamed concurrently. This is no longer true
for newly introduced method of adding a DC. Disable rf_rack_valid_keyspaces
in test_mv_first_replica_in_dc to force the old method.
2026-04-17 09:58:08 +02:00
Aleksandra Martyniuk
1bafc8394c service: handle multi_rf_change
Extend keyspace_rf_change handler to handle multi_rf_change.
multi_rf_change is allowed only if we add or remove DCs and
the keyspace uses rack list replication factor. The handler
adds the request id to topology::ongoing_rf_changes.
The request is further processed by load balancer.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
8fb91e245f service: implement make_rf_change_plan
In make_rf_change_plan, load balancer schedules necessary migrations,
considering the load of nodes and other pending tablet transitions.
Requests from ongoing_rf_changes are processed concurrently, independently
from one another. In each request racks are processed concurrently.
No tablet replica will be removed until all required replicas are added.
While adding replicas to each rack we always start with base tables
and won't proceed with views until they are done (while removing - the other
way around).

Node availability is checked at two levels for extending actions:

1) In prepare_per_rack_rf_change_plan: the entire RF change request is
   aborted if any node in the target dc+rack is down, or if there are
   no live (non-excluded) nodes at all. Shrinking is never aborted.

2) In make_rf_change_plan: extending is skipped for a given round if
   any normal, non-excluded node in the target dc+rack is missing from
   the balanced node set. Shrinking always proceeds regardless.

The resulting behavior per node state combination (extending only):
  - all up                  -> proceed
  - some excluded + some up -> proceed (excluded nodes are skipped)
  - any down node           -> abort
  - all excluded (no live)  -> abort

When the last step is finished:
- in system_schema.keyspaces:
  - next_replication is cleared;
  - new keyspace properties are saved (if request succeeded);
- request is removed from ongoing_rf_changes;
- the request is marked as done in system.topology_requests.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
89a17491db service: add keyspace_rf_change_plan to migration_plan
Add keyspace_rf_change_plan to migration_plan.

The keyspace_rf_change_plan consists of:
- completion - info about the request for which all migrations are done. Only one
  request can be completed at the time, even if more have finished migrations
  (the rest will be completed later). Based on it:
    - next_replication is cleared;
    - new keyspace properties are saved (only if succeeded);
    - request is removed from ongoing_rf_changes;
    - the request is marked as done in system.topology_requests.
- aborts - info about requests that cannot complete because the required
  rf change is impossible (e.g. no available nodes in a required rack).
  Multiple requests can be aborted in a single plan. Based on each:
    - next_replication is set to current_replication (rolling back);
    - the request is marked as aborted with an error in system.topology_requests.

The scheduled rebuilds will be kept in migration_plan::_migrations.

Based on that the canonical_mutations are generated.

Add update_topology_state_with_mixed_change and use it if any schema
changes are required, i.e. if plan contains keyspace_rf_change_plan::completion.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
bcdab2e012 service: extend tablet_migration_info to handle rebuilds
Make tablet_migration_info::{src,dst} optional, so that it can be
reused by rebuild, for respectively leaving and pending replica.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
d41c5a7db4 service: split update_node_load_on_migration
Split update_node_load_on_migration into decrease_node_load and
increase_node_load - in the following changes for rebuilds we will
need only one of those at the time.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
dd83666733 service: rearrange keyspace_rf_change handler
In the following changes, keyspace_rf_change handler will also consider
a change of RF by more than one. Rearrange the handler, so that it
first chooses a kind of RF change and then creates relevant updates.

Do not wrap the code in schedule_migration function, as we no longer
need a quick return possibility.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
72bb3113ac db: add columns to system_schema.keyspaces
Add a new next_replication column to system_schema.keyspaces table.

While there is an ongoing RF change:
- next_replication keeps the target RF values;
- existing replication_v2 column keeps initial RF values - the ones we
  started the RF change with.

DESCRIBE KEYSPACE statement shows replication_v2.

When there is no ongoing RF change for this keyspace, its
next_replication is empty.

In this commit no data is kept in the new column.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
751af38f2a db: service: add ongoing_rf_changes to system.topology
Following changes, will allow adding or removing all keyspace
replicas in a DC with a single ALTER KEYSPACE. For such operations,
the tablet load balancer needs to schedule rebuilds. To track
which RF change requests require rebuilds, we maintain a vector
of RF changes along with their ongoing rebuild phases.

Add a new ongoing_rf_changes column to system.topology to keep track
of those requests.

In this commit no data is kept in the new column.
2026-04-17 09:58:07 +02:00
Aleksandra Martyniuk
7cdf7d62a2 gms: add keyspace_multi_rf_change feature 2026-04-17 09:58:05 +02:00
Łukasz Paszkowski
4657d9e32c streaming: reject mutation fragments on critical disk utilization
The stream_mutation_fragments RPC handler did not check
is_in_critical_disk_utilization_mode before accepting incoming mutation
fragments. This meant load-and-stream (nodetool refresh --load-and-stream)
could push data onto a node at critical disk utilization, potentially
filling the disk completely.

Add a critical disk utilization check in the get_next_mutation_fragment
lambda, throwing critical_disk_utilization_exception when the node is in
critical mode. This mirrors the existing protection in stream_blob.cc.

Also remove the xfail marker from the corresponding test added in the
previous commit.
2026-04-17 09:31:26 +02:00
Gleb Natapov
66b3fc4e2c db/system_distributed_keyspace: drop CDC_TOPOLOGY_DESCRIPTION and CDC_GENERATIONS_V2
They are used only to prevent permission change, but since tables are
unused even if they exists there is no problem changing their
permissions, so no point keeping the definitions just for that.
2026-04-16 14:11:01 +03:00
Łukasz Paszkowski
61877e9dfb test/cluster/storage: Add a reproducer for load-and-stream out-of-space rejection
Add `test_load_and_stream_rejected_on_critical_disk` which verifies
that `nodetool refresh --load-and-stream` is rejected when the target
node reaches critical disk utilization during streaming. The test is
marked xfail because the stream_mutation_fragments handler does not
yet check whether the node is in the critical disk utilization mode
(introduced in the next patch).

The test sets up a 3-node cluster, writes data and snapshots SSTables
on one node, wipes another node's data, and copies the snapshot to its
upload directory. It then starts load-and-stream and uses the
`write_components_writer_created` error injection to pause SSTable writing.
While paused, the test fills the disk past the critical threshold, then
releases the injection. The next streamed mutation fragment is rejected
with critical_disk_utilization_exception.

The test verifies that:

- The operation fails with the expected error.
- No data is persisted on the target node.
- Partial SSTable files created during streaming are deleted (via the
  implicit mark-for-deletion mechanism in the SSTable lifecycle).
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
8d34127684 sstables: clean up TemporaryHashes file in wipe()
The TemporaryHashes.db.tmp file is created during SSTable writing to
store intermediate bloom filter hashes and is deleted before the SSTable
is sealed. Since it is not tracked in the TOC, it is also absent from
_recognized_components and all_components().

When an SSTable write fails before sealing (e.g. streaming rejected due
to critical disk utilization), wipe() is called to clean up the partial
SSTable. However, wipe() only iterates over all_components(), so the
TemporaryHashes file was left behind as an orphan.

Previously, the only cleanup mechanism for this file was the
startup-time directory scanner in sstable_directory, which would not
help when the orphan needs to be cleaned up at runtime.

Explicitly remove the TemporaryHashes file in wipe(), ignoring ENOENT
for the common case where the file was already removed before sealing.
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
159675e975 sstables: add error injection point in write_components
Add a `write_components_writer_created` error injection point in
`sstable::write_components()` between writer creation and fragment
consumption.

This injection is needed by the out-of-space streaming test (added in
the next patch) to reliably pause SSTable writing at the right moment:
after the SSTable writer has been created and files exist on disk, but
before mutation fragments are consumed.

Pausing earlier (before writer creation) would not work because there
are no files on disk yet, while pausing later (after consuming fragments)
would be too late to reliably push the node into critical disk utilization.
2026-04-16 08:38:34 +02:00
Łukasz Paszkowski
d1a24aa16a test/cluster/storage: extract validate_data_existence to module scope
Move validate_data_existence out of test_user_writes_rejection into
module scope so it can be reused by other tests in the file. No
functional change.
2026-04-16 08:38:33 +02:00
Łukasz Paszkowski
9c82b76755 test/cluster: enable suppress_disk_space_threshold_checks in tests using data_file_capacity
Tests that override disk capacity via the data_file_capacity config
option trigger the disk space monitor's critical utilization mode and
as a consequence activate out-of-space prevention mechanisms.

This will cause bootstrap failures with critical_disk_utilization_exception
during mutation-based streaming introduced later in the series.

Enable the `suppress_disk_space_threshold_checks` error injection at
startup in the affected tests to prevent the disk space monitor from
interfering with the test-configured capacity values.

Affected tests:
- test_balance_empty_tablets (test/cluster/test_size_based_load_balancing.py)
- test_load_stats_on_coordinator_failover (test/cluster/test_tablet_stats.py)
2026-04-16 08:38:33 +02:00
Łukasz Paszkowski
3726e31c03 utils/disk_space_monitor: add error injection to suppress threshold checks
Add the `suppress_disk_space_threshold_checks` error injection point
to the disk space monitor. When enabled, the threshold listener
short-circuits without evaluating disk utilization.

This is useful for tests that override disk capacity via `data_file_capacity`,
where the real disk usage causes the monitor to incorrectly report
critical utilization and activate out-of-space prevention mechanisms.
2026-04-16 08:38:33 +02:00
Gleb Natapov
b55748ec54 db/system_distributed_keyspace: remove unused code
system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION table is not
created since scylla-4.6.0 but we still have code to mark it as sync.
2026-04-15 15:48:49 +03:00
Gleb Natapov
8713eda271 db/system_distributed_keyspace: drop old cdc_generation_descriptions_v2 table
The generation management moved to raft and old table is no longer
used.
2026-04-15 15:48:48 +03:00
Gleb Natapov
24171ce62b db/system_distributed_keyspace: drop old service_levels table
Service level management moved to raft and old table is no longer
supported.
2026-04-15 15:48:48 +03:00
Gleb Natapov
6a03768b35 fix indent after the previous patch 2026-04-15 15:48:48 +03:00
Gleb Natapov
0ef06a34ed group0: call setup_group0 only when needed
setup_group0 and setup_group0_if_exist have hidden condition inside that
make them no-op. It is not clear at the call site that functions may do
nothing. Change the code to check the conditions at the call site
instead.
2026-04-15 15:48:48 +03:00
Botond Dénes
9999e7b642 test/perf/perf_simple_query: add --collection=N
Defaults to 0. When N > 0, adds a map<blob, blob> collection column to
the schema. Each row will have a collection cell with N elements.
Allows benchmarking collection handling.
2026-04-15 09:46:54 +03:00
Botond Dénes
4ff96f0092 test/boost/frozen_mutation_test: add freeze/unfreeze test for large collections 2026-04-15 09:46:54 +03:00
Botond Dénes
4eef5e1c65 mutation/mutation_partition_view: use read_from_collection_cell_view() to read collections
This cuts back on the number of allocations required for deserializing
collections, from O(num_cells) to O(1).

The visitor now receives an rvalue, so update all callers of
read_and_visit_row(), patching their vistors to take advantage of this
and move the serialized collection instead of copying it.
2026-04-15 09:46:54 +03:00
Botond Dénes
1bb04824a8 mutation/collection_mutation: introduce read_from_collection_cell_view()
Reads a collection_mutation directly from the IDL representation of a
collection. This cuts down the number of allocations required
drastically compared to the current method of:

    IDL -> collection_mutatio_description -> collection_mutation

Intended to be used in frozen_mutation::unfreeze() and similar use-cases.
2026-04-15 09:46:54 +03:00
Botond Dénes
5f2c003445 mutation/atomic_cell: atomic_cell_type: add write*() and *serialized_size()
atomic_cell_type has various static make_*() methods which create a
serialized cell based on the parameters. This patch adds write_()
methods which mirror the existing make_*() ones, with the exception that
the write methods write into caller-provided buffer. The make methods
are refactored to call the appropriate write overload.
*_serialized_size() methods are added as well, to calculate how many
bytes the serialized data will take after the appropriate write call.
This allows code to write cells directly into a pre-arranged buffer,
perhaps even multiple ones into the same one.

Since the intended use-case this patch prepares for is serializing an
entire collection directly into a single buffer, only make variants
which are legal in collections are handled. I.e. counters are not.
2026-04-15 09:46:54 +03:00
Botond Dénes
5b5cb94115 mutation/collection_mutation: generalize serialize_collection_mutation
This is already a template on Iterator, but generalize it further by
adding an Adaptor template which adapts the Iterator::value_type to the
requirements of the method. This allows passing Iterators with
value_type other than atomic_cell[_view].
2026-04-15 09:46:54 +03:00
Botond Dénes
17ac9da5d2 mutation/mutation_partition_view: avoid copying collection 2026-04-15 09:46:54 +03:00
Botond Dénes
aab336eb77 mutation/mutation_partition_view: accept collection_mutation in the consume API
Instead of collection_mutation_view. Follow-suit of the atomic_cell
overloads, which already accept a value, to allow for caller to move the
value along. The current interface forces collections to be copied.
2026-04-15 09:46:54 +03:00
Botond Dénes
652676e563 partition_builder: add move variant of accept_*_cell() collection overloads
Atomic cell overloads already have it, add it for the collection ones
too. Will be used to help copying collections unnecessarily.
2026-04-15 09:46:53 +03:00
Petr Gusev
8a16746e55 strong_consistency: fix crash when DROP TABLE races with in-flight DML
When DROP TABLE races with an in-flight DML on a strongly-consistent
table, the node aborts in groups_manager::acquire_server() because the
raft group has already been erased from _raft_groups.

A concurrent DROP TABLE may have already removed the table from database
registries and erased the raft group via schedule_raft_group_deletion.
The schema.table() in create_operation_ctx() might not fail though
because someone might be holding lw_shared_ptr<table>, so that the
table is dropped but the table object is still alive.

Fix by accepting table_id in acquire_server and checking that the table
still exists in the database via find_column_family before looking up
the raft group.  If the table has been dropped, find_column_family
throws no_such_column_family instead of the node aborting via
on_internal_error.  When the table does exist, acquire_server proceeds
to acquire state.gate; schedule_raft_group_deletion co_awaits
gate::close, so it will wait for the DML operation to complete before
erasing the group.

Fixes SCYLLADB-1450
2026-04-10 22:56:16 +02:00
Petr Gusev
82460e7a38 test: add regression test for DROP TABLE racing with in-flight DML
Add test_drop_table_during_insert that reproduces a crash when DROP TABLE
races with an in-flight INSERT on a strongly-consistent table.  The test
uses an error injection to pause INSERT between obtaining the ERM and
calling acquire_server, then drops the table (which destroys the raft
group), then resumes the INSERT.  Without a fix, the node aborts in
acquire_server via on_internal_error.

The test is marked as skip until the fix is in place.
2026-04-10 22:56:16 +02:00
335 changed files with 9865 additions and 4229 deletions

4
.github/CODEOWNERS vendored
View File

@@ -32,8 +32,8 @@ counters* @nuivall
tests/counter_test* @nuivall
# DOCS
docs/* @annastuchlik @tzach
docs/alternator @annastuchlik @tzach @nyh
/docs/ @annastuchlik @tzach
/docs/alternator/ @annastuchlik @tzach @nyh
# GOSSIP
gms/* @tgrabiec @asias @kbr-scylla

4
.gitignore vendored
View File

@@ -36,4 +36,6 @@ compile_commands.json
clang_build
.idea/
nuke
rust/target
rust/**/target
rust/**/Cargo.lock
test/resource/wasm/rust/target

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2026.2.0-dev
VERSION=2026.3.0-dev
if test -f version
then

View File

@@ -681,7 +681,7 @@ static bool calculate_primitive_condition(const parsed::primitive_condition& con
case parsed::primitive_condition::type::VALUE:
if (calculated_values.size() != 1) {
// Shouldn't happen unless we have a bug in the parser
throw std::logic_error(format("Unexpected values in primitive_condition", cond._values.size()));
throw std::logic_error(format("Unexpected values {} in primitive_condition", cond._values.size()));
}
// Unwrap the boolean wrapped as the value (if it is a boolean)
if (calculated_values[0].IsObject() && calculated_values[0].MemberCount() == 1) {

View File

@@ -1362,6 +1362,33 @@ static int get_dimensions(const rjson::value& vector_attribute, std::string_view
return dimensions_v->GetInt();
}
// As noted in issue #5052, in Alternator the CreateTable and UpdateTable are
// currently synchronous - they return only after the operation is complete.
// After announce() of the new schema finished, the schema change is committed
// and a majority of nodes know it - but it's possible that some live nodes
// have not yet applied the new schema. If we return to the user now, and the
// user sends a node request that relies on the new schema, it might fail.
// So before returning, we must verify that *all* nodes have applied the new
// schema. This is what wait_for_schema_agreement_after_ddl() does.
//
// Note that wait_for_schema_agreement_after_ddl() has a timeout (currently
// hard-coded to 30 seconds). If the timeout is reached an InternalServerError
// is returned. The user, who doesn't know if the CreateTable succeeded or not,
// can retry the request and will get a ResourceInUseException and know the
// table already exists. So a CreateTable that returns a ResourceInUseException
// should also call wait_for_schema_agreement_after_ddl().
//
// When issue #5052 is resolved, this function can be removed - we will need
// to check if we reached schema agreement, but not to *wait* for it.
static future<> wait_for_schema_agreement_after_ddl(service::migration_manager& mm, const replica::database& db) {
static constexpr auto schema_agreement_seconds = 30;
try {
co_await mm.wait_for_schema_agreement(db, db::timeout_clock::now() + std::chrono::seconds(schema_agreement_seconds), nullptr);
} catch (const service::migration_manager::schema_agreement_timeout&) {
throw api_error::internal(fmt::format("The operation was successful, but unable to confirm cluster-wide schema agreement after {} seconds. Please retry the operation, and wait for the retry to report an error since the operation was already done.", schema_agreement_seconds));
}
}
future<executor::request_return_type> executor::create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization,
const db::tablets_mode_t::mode tablets_mode, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
throwing_assert(this_shard_id() == 0);
@@ -1695,13 +1722,26 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
}
}
}
bool table_already_exists = false;
try {
schema_mutations = service::prepare_new_keyspace_announcement(_proxy.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
if (_proxy.data_dictionary().has_schema(keyspace_name, table_name)) {
co_return api_error::resource_in_use(fmt::format("Table {} already exists", table_name));
table_already_exists = true;
}
}
if (table_already_exists) {
// The user may have retried a CreateTable operation after it timed
// out in wait_for_schema_agreement_after_ddl(). So before we may
// return ResourceInUseException (which can lead the user to start
// using the table which it now knows exists), we need to wait for
// schema agreement, just like the original CreateTable did. Again
// we fail with InternalServerError if schema agreement still cannot
// be reached. We can release group0_guard before waiting.
release_guard(std::move(group0_guard));
co_await wait_for_schema_agreement_after_ddl(_mm, _proxy.local_db());
co_return api_error::resource_in_use(fmt::format("Table {} already exists", table_name));
}
if (_proxy.data_dictionary().try_find_table(schema->id())) {
// This should never happen, the ID is supposed to be unique
co_return api_error::internal(format("Table with ID {} already exists", schema->id()));
@@ -1750,7 +1790,7 @@ future<executor::request_return_type> executor::create_table_on_shard0(service::
}
}
co_await _mm.wait_for_schema_agreement(_proxy.local_db(), db::timeout_clock::now() + 10s, nullptr);
co_await wait_for_schema_agreement_after_ddl(_mm, _proxy.local_db());
rjson::value status = rjson::empty_object();
executor::supplement_table_info(request, *schema, _proxy);
rjson::add(status, "TableDescription", std::move(request));
@@ -1860,7 +1900,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
rjson::value* stream_specification = rjson::find(request, "StreamSpecification");
if (stream_specification && stream_specification->IsObject()) {
empty_request = false;
if (add_stream_options(*stream_specification, builder, p.local())) {
if (add_stream_options(*stream_specification, builder, p.local(), tab->cdc_options())) {
validate_cdc_log_name_length(builder.cf_name());
// On tablet tables, defer stream enablement and block
// tablet merges (see defer_enabling_streams_block_tablet_merges).
@@ -1875,6 +1915,23 @@ future<executor::request_return_type> executor::update_table(client_state& clien
if (tab->cdc_options().enabled() || tab->cdc_options().enable_requested()) {
co_return api_error::validation("Table already has an enabled stream: TableName: " + tab->cf_name());
}
// When re-enabling streams on an Alternator table, drop the old
// CDC log table first as a separate schema change, so the
// subsequent UpdateTable creates a fresh one with a new UUID
// (= new StreamArn). See #7239.
auto logname = cdc::log_name(tab->cf_name());
auto& local_db = p.local().local_db();
if (local_db.has_schema(tab->ks_name(), logname)
&& cdc::is_log_schema(*local_db.find_schema(tab->ks_name(), logname))) {
auto drop_m = co_await service::prepare_column_family_drop_announcement(
p.local(), tab->ks_name(), logname,
group0_guard.write_timestamp());
co_await mm.announce(std::move(drop_m), std::move(group0_guard),
format("alternator-executor: drop old CDC log for {}", tab->cf_name()));
co_await mm.wait_for_schema_agreement(
p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
continue;
}
}
else if (!tab->cdc_options().enabled() && !tab->cdc_options().enable_requested()) {
co_return api_error::validation("Table has no stream to disable: TableName: " + tab->cf_name());
@@ -1892,7 +1949,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
}
if (vector_index_updates->Size() > 1) {
// VectorIndexUpdates mirrors GlobalSecondaryIndexUpdates.
// Since DynamoDB artifically limits the latter to just a
// Since DynamoDB artificially limits the latter to just a
// single operation (one Create or one Delete), we also
// place the same artificial limit on VectorIndexUpdates,
// and throw the same LimitExceeded error if the client
@@ -2189,7 +2246,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
throw;
}
}
co_await mm.wait_for_schema_agreement(p.local().local_db(), db::timeout_clock::now() + 10s, nullptr);
co_await wait_for_schema_agreement_after_ddl(mm, p.local().local_db());
rjson::value status = rjson::empty_object();
supplement_table_info(request, *schema, p.local());

View File

@@ -30,6 +30,7 @@
#include "utils/updateable_value.hh"
#include "tracing/trace_state.hh"
#include "cdc/cdc_options.hh"
namespace db {
@@ -199,7 +200,7 @@ private:
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp, const cdc::options& existing_cdc_opts = {});
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};

View File

@@ -1354,7 +1354,7 @@ static future<executor::request_return_type> query_vector(
std::unordered_set<std::string> used_attribute_values;
// Parse the Select parameter and determine which attributes to return.
// For a vector index, the default Select is ALL_ATTRIBUTES (full items).
// ALL_PROJECTED_ATTRIBUTES is significantly more efficent because it
// ALL_PROJECTED_ATTRIBUTES is significantly more efficient because it
// returns what the vector store returned without looking up additional
// base-table data. Currently only the primary key attributes are projected
// but in the future we'll implement projecting additional attributes into

View File

@@ -167,46 +167,8 @@ static schema_ptr get_schema_from_arn(service::storage_proxy& proxy, const strea
}
}
// ShardId. Must be between 28 and 65 characters inclusive.
// UUID is 36 bytes as string (including dashes).
// Prepend a version/type marker (`S`) -> 37
class stream_shard_id : public utils::UUID {
public:
using UUID = utils::UUID;
static constexpr char marker = 'S';
stream_shard_id() = default;
stream_shard_id(const UUID& uuid)
: UUID(uuid)
{}
stream_shard_id(const table_id& tid)
: UUID(tid.uuid())
{}
stream_shard_id(std::string_view v)
: UUID(v.substr(1))
{
if (v[0] != marker) {
throw std::invalid_argument(std::string(v));
}
}
friend std::ostream& operator<<(std::ostream& os, const stream_shard_id& arn) {
const UUID& uuid = arn;
return os << marker << uuid;
}
friend std::istream& operator>>(std::istream& is, stream_shard_id& arn) {
std::string s;
is >> s;
arn = stream_shard_id(s);
return is;
}
};
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_shard_id>
: public from_string_helper<ValueType, alternator::stream_shard_id>
{};
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>
: public from_string_helper<ValueType, alternator::stream_arn>
@@ -218,7 +180,8 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
_stats.api_operations.list_streams++;
auto limit = rjson::get_opt<int>(request, "Limit").value_or(100);
auto streams_start = rjson::get_opt<stream_shard_id>(request, "ExclusiveStartStreamArn");
auto streams_start = rjson::get_opt<stream_arn>(request, "ExclusiveStartStreamArn");
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
@@ -244,34 +207,34 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
cfs = db.get_tables();
}
// # 12601 (maybe?) - sort the set of tables on ID. This should ensure we never
// generate duplicates in a paged listing here. Can obviously miss things if they
// are added between paged calls and end up with a "smaller" UUID/ARN, but that
// is to be expected.
// We need to sort the tables to ensure a stable order for paging.
// We sort by keyspace and table name, which will also allow us to skip to
// the right position by ExclusiveStartStreamArn.
auto cmp = [](std::string_view ks1, std::string_view cf1, std::string_view ks2, std::string_view cf2) {
return ks1 == ks2 ? cf1 < cf2 : ks1 < ks2;
};
if (std::cmp_less(limit, cfs.size()) || streams_start) {
std::sort(cfs.begin(), cfs.end(), [](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return t1.schema()->id().uuid() < t2.schema()->id().uuid();
});
std::sort(cfs.begin(), cfs.end(),
[&cmp](const data_dictionary::table& t1, const data_dictionary::table& t2) {
return cmp(t1.schema()->ks_name(), t1.schema()->cf_name(),
t2.schema()->ks_name(), t2.schema()->cf_name());
});
}
auto i = cfs.begin();
auto e = cfs.end();
if (streams_start) {
i = std::find_if(i, e, [&](const data_dictionary::table& t) {
return t.schema()->id().uuid() == streams_start
&& cdc::get_base_table(db.real_database(), *t.schema())
&& is_alternator_keyspace(t.schema()->ks_name())
;
});
if (i != e) {
++i;
}
i = std::upper_bound(i, e, *streams_start,
[&cmp](const stream_arn& arn, const data_dictionary::table& t) {
return cmp(arn.keyspace_name(), arn.table_name(),
t.schema()->ks_name(), t.schema()->cf_name());
});
}
auto ret = rjson::empty_object();
auto streams = rjson::empty_array();
std::optional<stream_shard_id> last;
std::optional<stream_arn> last;
for (;limit > 0 && i != e; ++i) {
auto s = i->schema();
@@ -280,21 +243,29 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
if (!is_alternator_keyspace(ks_name)) {
continue;
}
if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {
// Use get_base_table instead of is_log_for_some_table because the
// latter requires CDC to be enabled, but we want to list streams
// that have been disabled but whose log table still exists (#7239).
if (cdc::get_base_table(db.real_database(), ks_name, cf_name)) {
rjson::value new_entry = rjson::empty_object();
last = i->schema()->id();
auto arn = stream_arn{ i->schema(), cdc::get_base_table(db.real_database(), *i->schema()) };
rjson::add(new_entry, "StreamArn", arn);
rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));
rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(s->cf_name())));
rjson::push_back(streams, std::move(new_entry));
last = std::move(arn);
--limit;
}
}
rjson::add(ret, "Streams", std::move(streams));
if (last) {
// Only emit LastEvaluatedStreamArn when we stopped because we hit the
// limit (limit == 0), meaning there may be more streams to list.
// If we exhausted all tables naturally (limit > 0), there are no more
// streams, so we must not emit a cookie.
if (last && limit == 0) {
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
@@ -424,7 +395,7 @@ std::istream& operator>>(std::istream& is, stream_view_type& type) {
return is;
}
static stream_view_type cdc_options_to_steam_view_type(const cdc::options& opts) {
static stream_view_type cdc_options_to_stream_view_type(const cdc::options& opts) {
stream_view_type type = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
type = stream_view_type::NEW_AND_OLD_IMAGES;
@@ -614,7 +585,7 @@ void stream_id_range::prepare_for_iterating()
// the function returns `stream_id_range` that will allow iteration over children Streams shards for the Streams shard `parent`
// a child Streams shard is defined as a Streams shard that touches token range that was previously covered by `parent` Streams shard
// Streams shard contains a token, that represents end of the token range for that Streams shard (inclusive)
// begginning of the token range is defined by previous Streams shard's token + 1
// beginning of the token range is defined by previous Streams shard's token + 1
// NOTE: With vnodes, ranges of Streams' shards wrap, while with tablets the biggest allowed token number is always a range end.
// NOTE: both streams generation are guaranteed to cover whole range and be non-empty
// NOTE: it's possible to get more than one stream shard with the same token value (thus some of those stream shards will be empty) -
@@ -870,6 +841,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto& opts = bs->cdc_options();
auto status = "DISABLED";
bool stream_disabled = !opts.enabled();
if (opts.enabled()) {
if (!_cdc_metadata.streams_available()) {
@@ -885,7 +857,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
rjson::add(stream_desc, "StreamStatus", rjson::from_string(status));
stream_view_type type = cdc_options_to_steam_view_type(opts);
stream_view_type type = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamArn", stream_arn);
rjson::add(stream_desc, "StreamViewType", type);
@@ -893,10 +865,9 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
describe_key_schema(stream_desc, *bs);
if (!opts.enabled()) {
rjson::add(ret, "StreamDescription", std::move(stream_desc));
co_return rjson::print(std::move(ret));
}
// For disabled streams, we still fall through to enumerate shards
// below. All shards will have EndingSequenceNumber set, indicating
// they are closed. See issue #7239.
// TODO: label
// TODO: creation time
@@ -979,6 +950,12 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
auto expired = [&]() -> std::optional<db_clock::time_point> {
auto j = std::next(i);
if (j == e) {
// For a disabled stream, all shards are closed (#7239).
// Use "now" as the ending sequence number for the last
// generation's shards.
if (stream_disabled) {
return db_clock::now();
}
return std::nullopt;
}
// add this so we sort of match potential
@@ -1329,7 +1306,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
| std::ranges::to<query::column_id_vector>()
;
stream_view_type type = cdc_options_to_steam_view_type(base->cdc_options());
stream_view_type type = cdc_options_to_stream_view_type(base->cdc_options());
auto selection = cql3::selection::selection::for_columns(schema, std::move(columns));
auto partition_slice = query::partition_slice(
@@ -1513,17 +1490,17 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
if (!base->cdc_options().enabled()) {
// Stream is disabled -- all shards are closed (#7239).
// Don't return NextShardIterator.
} else if (shard.time < ts && ts < high_ts) {
// The DynamoDB documentation states that when a shard is
// closed, reading it until the end has NextShardIterator
// "set to null". Our test test_streams_closed_read
// confirms that by "null" they meant not set at all.
} else {
// We could have return the same iterator again, but we did
// a search from it until high_ts and found nothing, so we
// can also start the next search from high_ts.
// TODO: but why? It's simpler just to leave the iterator be.
shard_iterator next_iter(iter.table, iter.shard, utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch()), true);
// Shard is still open with no records in the scanned window.
// Return the original iterator so the client can poll again.
rjson::add(ret, "NextShardIterator", iter);
}
_stats.api_operations.get_records_latency.mark(std::chrono::steady_clock::now() - start_time);
@@ -1533,17 +1510,13 @@ future<executor::request_return_type> executor::get_records(client_state& client
co_return rjson::print(std::move(ret));
}
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp) {
bool executor::add_stream_options(const rjson::value& stream_specification, schema_builder& builder, service::storage_proxy& sp, const cdc::options& existing_cdc_opts) {
auto stream_enabled = rjson::find(stream_specification, "StreamEnabled");
if (!stream_enabled || !stream_enabled->IsBool()) {
throw api_error::validation("StreamSpecification needs boolean StreamEnabled");
}
if (stream_enabled->GetBool()) {
if (!sp.features().alternator_streams) {
throw api_error::validation("StreamSpecification: alternator streams feature not enabled in cluster.");
}
cdc::options opts;
opts.enabled(true);
opts.tablet_merge_blocked(true);
@@ -1569,8 +1542,13 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
builder.with_cdc_options(opts);
return true;
} else {
cdc::options opts;
// When disabling, preserve the existing CDC options (preimage,
// postimage, ttl, etc.) so that DescribeStream can still report
// the correct StreamViewType on a disabled stream.
cdc::options opts = existing_cdc_opts;
opts.enabled(false);
opts.enable_requested(false);
opts.tablet_merge_blocked(false);
builder.with_cdc_options(opts);
return false;
}
@@ -1578,33 +1556,36 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
void executor::supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp) {
auto& opts = schema.cdc_options();
if (opts.enabled()) {
auto db = sp.data_dictionary();
auto cf = db.find_table(schema.ks_name(), cdc::log_name(schema.cf_name()));
stream_arn arn(cf.schema(), cdc::get_base_table(db.real_database(), *cf.schema()));
// Report stream info when:
// 1. Log table exists (covers both enabled and disabled-but-readable).
// 2. enable_requested (ENABLING state, log not yet created).
auto db = sp.data_dictionary();
auto log_name = cdc::log_name(schema.cf_name());
auto log_cf = db.try_find_table(schema.ks_name(), log_name);
if (log_cf) {
auto log_schema = log_cf->schema();
stream_arn arn(log_schema, cdc::get_base_table(db.real_database(), *log_schema));
rjson::add(descr, "LatestStreamArn", arn);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));
} else if (!opts.enable_requested()) {
return;
}
// For both enabled() and enable_requested():
// DynamoDB returns StreamEnabled=true in StreamSpecification even when
// the stream status is ENABLING (not yet fully active). We mirror this
// behavior: enable_requested means the user asked for streams but CDC
// is not yet finalized, so we still report StreamEnabled=true.
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*log_schema)));
auto mode = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
mode = stream_view_type::NEW_AND_OLD_IMAGES;
} else if (opts.preimage()) {
mode = stream_view_type::OLD_IMAGE;
} else if (opts.postimage()) {
mode = stream_view_type::NEW_IMAGE;
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", opts.enabled());
stream_view_type mode = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
} else if (opts.enable_requested()) {
// DynamoDB returns StreamEnabled=true in StreamSpecification even when
// the stream status is ENABLING (not yet fully active). We mirror this
// behavior: enable_requested means the user asked for streams but CDC
// is not yet finalized, so we still report StreamEnabled=true.
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
stream_view_type mode = cdc_options_to_stream_view_type(opts);
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
} // namespace alternator

View File

@@ -856,7 +856,9 @@ rest_exclude_node(sharded<service::storage_service>& ss, std::unique_ptr<http::r
}
apilog.info("exclude_node: hosts={}", hosts);
co_await ss.local().mark_excluded(hosts);
co_await ss.local().run_with_no_api_lock([hosts = std::move(hosts)] (service::storage_service& ss) {
return ss.mark_excluded(hosts);
});
co_return json_void();
}
@@ -1731,7 +1733,9 @@ rest_create_vnode_tablet_migration(http_context& ctx, sharded<service::storage_s
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
co_await ss.local().prepare_for_tablets_migration(keyspace);
co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.prepare_for_tablets_migration(keyspace);
});
co_return json_void();
}
@@ -1743,7 +1747,9 @@ rest_get_vnode_tablet_migration(http_context& ctx, sharded<service::storage_serv
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
auto status = co_await ss.local().get_tablets_migration_status_with_node_details(keyspace);
auto status = co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.get_tablets_migration_status_with_node_details(keyspace);
});
ss::vnode_tablet_migration_status result;
result.keyspace = status.keyspace;
@@ -1768,7 +1774,9 @@ rest_set_vnode_tablet_migration_node_storage_mode(http_context& ctx, sharded<ser
}
auto mode_str = req->get_query_param("intended_mode");
auto mode = service::intended_storage_mode_from_string(mode_str);
co_await ss.local().set_node_intended_storage_mode(mode);
co_await ss.local().run_with_no_api_lock([mode] (service::storage_service& ss) {
return ss.set_node_intended_storage_mode(mode);
});
co_return json_void();
}
@@ -1782,7 +1790,9 @@ rest_finalize_vnode_tablet_migration(http_context& ctx, sharded<service::storage
auto keyspace = validate_keyspace(ctx, req);
validate_keyspace(ctx, keyspace);
co_await ss.local().finalize_tablets_migration(keyspace);
co_await ss.local().run_with_no_api_lock([keyspace] (service::storage_service& ss) {
return ss.finalize_tablets_migration(keyspace);
});
co_return json_void();
}
@@ -1859,90 +1869,106 @@ rest_bind(FuncType func, BindArgs&... args) {
return std::bind_front(func, std::ref(args)...);
}
// Hold the storage_service async gate for the duration of async REST
// handlers so stop() drains in-flight requests before teardown.
// Synchronous handlers don't yield and need no gate.
static seastar::httpd::future_json_function
gated(sharded<service::storage_service>& ss, seastar::httpd::future_json_function fn) {
return [fn = std::move(fn), &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto holder = ss.local().hold_async_gate();
co_return co_await fn(std::move(req));
};
}
static seastar::httpd::json_request_function
gated(sharded<service::storage_service>&, seastar::httpd::json_request_function fn) {
return fn;
}
void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& ssc, service::raft_group0_client& group0_client) {
ss::get_token_endpoint.set(r, rest_bind(rest_get_token_endpoint, ctx, ss));
ss::get_release_version.set(r, rest_bind(rest_get_release_version, ss));
ss::get_scylla_release_version.set(r, rest_bind(rest_get_scylla_release_version, ss));
ss::get_schema_version.set(r, rest_bind(rest_get_schema_version, ss));
ss::get_range_to_endpoint_map.set(r, rest_bind(rest_get_range_to_endpoint_map, ctx, ss));
ss::get_pending_range_to_endpoint_map.set(r, rest_bind(rest_get_pending_range_to_endpoint_map, ctx));
ss::describe_ring.set(r, rest_bind(rest_describe_ring, ctx, ss));
ss::get_current_generation_number.set(r, rest_bind(rest_get_current_generation_number, ss));
ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));
ss::get_natural_endpoints_v2.set(r, rest_bind(rest_get_natural_endpoints_v2, ctx, ss));
ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));
ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss, ssc));
ss::logstor_compaction.set(r, rest_bind(rest_logstor_compaction, ctx));
ss::logstor_flush.set(r, rest_bind(rest_logstor_flush, ctx));
ss::move.set(r, rest_bind(rest_move, ss));
ss::remove_node.set(r, rest_bind(rest_remove_node, ss));
ss::exclude_node.set(r, rest_bind(rest_exclude_node, ss));
ss::get_removal_status.set(r, rest_bind(rest_get_removal_status, ss));
ss::force_remove_completion.set(r, rest_bind(rest_force_remove_completion, ss));
ss::set_logging_level.set(r, rest_bind(rest_set_logging_level));
ss::get_logging_levels.set(r, rest_bind(rest_get_logging_levels));
ss::get_operation_mode.set(r, rest_bind(rest_get_operation_mode, ss));
ss::is_starting.set(r, rest_bind(rest_is_starting, ss));
ss::get_drain_progress.set(r, rest_bind(rest_get_drain_progress, ss));
ss::drain.set(r, rest_bind(rest_drain, ss));
ss::stop_gossiping.set(r, rest_bind(rest_stop_gossiping, ss));
ss::start_gossiping.set(r, rest_bind(rest_start_gossiping, ss));
ss::is_gossip_running.set(r, rest_bind(rest_is_gossip_running, ss));
ss::stop_daemon.set(r, rest_bind(rest_stop_daemon));
ss::is_initialized.set(r, rest_bind(rest_is_initialized, ss));
ss::join_ring.set(r, rest_bind(rest_join_ring));
ss::is_joined.set(r, rest_bind(rest_is_joined, ss));
ss::is_incremental_backups_enabled.set(r, rest_bind(rest_is_incremental_backups_enabled, ctx));
ss::set_incremental_backups_enabled.set(r, rest_bind(rest_set_incremental_backups_enabled, ctx));
ss::rebuild.set(r, rest_bind(rest_rebuild, ss));
ss::bulk_load.set(r, rest_bind(rest_bulk_load));
ss::bulk_load_async.set(r, rest_bind(rest_bulk_load_async));
ss::reschedule_failed_deletions.set(r, rest_bind(rest_reschedule_failed_deletions));
ss::sample_key_range.set(r, rest_bind(rest_sample_key_range));
ss::reset_local_schema.set(r, rest_bind(rest_reset_local_schema, ss));
ss::set_trace_probability.set(r, rest_bind(rest_set_trace_probability));
ss::get_trace_probability.set(r, rest_bind(rest_get_trace_probability));
ss::get_slow_query_info.set(r, rest_bind(rest_get_slow_query_info));
ss::set_slow_query.set(r, rest_bind(rest_set_slow_query));
ss::deliver_hints.set(r, rest_bind(rest_deliver_hints));
ss::get_cluster_name.set(r, rest_bind(rest_get_cluster_name, ss));
ss::get_partitioner_name.set(r, rest_bind(rest_get_partitioner_name, ss));
ss::get_tombstone_warn_threshold.set(r, rest_bind(rest_get_tombstone_warn_threshold));
ss::set_tombstone_warn_threshold.set(r, rest_bind(rest_set_tombstone_warn_threshold));
ss::get_tombstone_failure_threshold.set(r, rest_bind(rest_get_tombstone_failure_threshold));
ss::set_tombstone_failure_threshold.set(r, rest_bind(rest_set_tombstone_failure_threshold));
ss::get_batch_size_failure_threshold.set(r, rest_bind(rest_get_batch_size_failure_threshold));
ss::set_batch_size_failure_threshold.set(r, rest_bind(rest_set_batch_size_failure_threshold));
ss::set_hinted_handoff_throttle_in_kb.set(r, rest_bind(rest_set_hinted_handoff_throttle_in_kb));
ss::get_exceptions.set(r, rest_bind(rest_get_exceptions, ss));
ss::get_total_hints_in_progress.set(r, rest_bind(rest_get_total_hints_in_progress));
ss::get_total_hints.set(r, rest_bind(rest_get_total_hints));
ss::get_ownership.set(r, rest_bind(rest_get_ownership, ctx, ss));
ss::get_effective_ownership.set(r, rest_bind(rest_get_effective_ownership, ctx, ss));
ss::retrain_dict.set(r, rest_bind(rest_retrain_dict, ctx, ss, group0_client));
ss::estimate_compression_ratios.set(r, rest_bind(rest_estimate_compression_ratios, ctx, ss));
ss::sstable_info.set(r, rest_bind(rest_sstable_info, ctx));
ss::logstor_info.set(r, rest_bind(rest_logstor_info, ctx));
ss::reload_raft_topology_state.set(r, rest_bind(rest_reload_raft_topology_state, ss, group0_client));
ss::upgrade_to_raft_topology.set(r, rest_bind(rest_upgrade_to_raft_topology, ss));
ss::raft_topology_upgrade_status.set(r, rest_bind(rest_raft_topology_upgrade_status, ss));
ss::raft_topology_get_cmd_status.set(r, rest_bind(rest_raft_topology_get_cmd_status, ss));
ss::move_tablet.set(r, rest_bind(rest_move_tablet, ctx, ss));
ss::add_tablet_replica.set(r, rest_bind(rest_add_tablet_replica, ctx, ss));
ss::del_tablet_replica.set(r, rest_bind(rest_del_tablet_replica, ctx, ss));
ss::repair_tablet.set(r, rest_bind(rest_repair_tablet, ctx, ss));
ss::tablet_balancing_enable.set(r, rest_bind(rest_tablet_balancing_enable, ss));
ss::create_vnode_tablet_migration.set(r, rest_bind(rest_create_vnode_tablet_migration, ctx, ss));
ss::get_vnode_tablet_migration.set(r, rest_bind(rest_get_vnode_tablet_migration, ctx, ss));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss));
ss::finalize_vnode_tablet_migration.set(r, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss));
ss::quiesce_topology.set(r, rest_bind(rest_quiesce_topology, ss));
sp::get_schema_versions.set(r, rest_bind(rest_get_schema_versions, ss));
ss::drop_quarantined_sstables.set(r, rest_bind(rest_drop_quarantined_sstables, ctx, ss));
ss::get_token_endpoint.set(r, gated(ss, rest_bind(rest_get_token_endpoint, ctx, ss)));
ss::get_release_version.set(r, gated(ss, rest_bind(rest_get_release_version, ss)));
ss::get_scylla_release_version.set(r, gated(ss, rest_bind(rest_get_scylla_release_version, ss)));
ss::get_schema_version.set(r, gated(ss, rest_bind(rest_get_schema_version, ss)));
ss::get_range_to_endpoint_map.set(r, gated(ss, rest_bind(rest_get_range_to_endpoint_map, ctx, ss)));
ss::get_pending_range_to_endpoint_map.set(r, gated(ss, rest_bind(rest_get_pending_range_to_endpoint_map, ctx)));
ss::describe_ring.set(r, gated(ss, rest_bind(rest_describe_ring, ctx, ss)));
ss::get_current_generation_number.set(r, gated(ss, rest_bind(rest_get_current_generation_number, ss)));
ss::get_natural_endpoints.set(r, gated(ss, rest_bind(rest_get_natural_endpoints, ctx, ss)));
ss::get_natural_endpoints_v2.set(r, gated(ss, rest_bind(rest_get_natural_endpoints_v2, ctx, ss)));
ss::cdc_streams_check_and_repair.set(r, gated(ss, rest_bind(rest_cdc_streams_check_and_repair, ss)));
ss::cleanup_all.set(r, gated(ss, rest_bind(rest_cleanup_all, ctx, ss)));
ss::reset_cleanup_needed.set(r, gated(ss, rest_bind(rest_reset_cleanup_needed, ctx, ss)));
ss::force_flush.set(r, gated(ss, rest_bind(rest_force_flush, ctx)));
ss::force_keyspace_flush.set(r, gated(ss, rest_bind(rest_force_keyspace_flush, ctx)));
ss::decommission.set(r, gated(ss, rest_bind(rest_decommission, ss, ssc)));
ss::logstor_compaction.set(r, gated(ss, rest_bind(rest_logstor_compaction, ctx)));
ss::logstor_flush.set(r, gated(ss, rest_bind(rest_logstor_flush, ctx)));
ss::move.set(r, gated(ss, rest_bind(rest_move, ss)));
ss::remove_node.set(r, gated(ss, rest_bind(rest_remove_node, ss)));
ss::exclude_node.set(r, gated(ss, rest_bind(rest_exclude_node, ss)));
ss::get_removal_status.set(r, gated(ss, rest_bind(rest_get_removal_status, ss)));
ss::force_remove_completion.set(r, gated(ss, rest_bind(rest_force_remove_completion, ss)));
ss::set_logging_level.set(r, gated(ss, rest_bind(rest_set_logging_level)));
ss::get_logging_levels.set(r, gated(ss, rest_bind(rest_get_logging_levels)));
ss::get_operation_mode.set(r, gated(ss, rest_bind(rest_get_operation_mode, ss)));
ss::is_starting.set(r, gated(ss, rest_bind(rest_is_starting, ss)));
ss::get_drain_progress.set(r, gated(ss, rest_bind(rest_get_drain_progress, ss)));
ss::drain.set(r, gated(ss, rest_bind(rest_drain, ss)));
ss::stop_gossiping.set(r, gated(ss, rest_bind(rest_stop_gossiping, ss)));
ss::start_gossiping.set(r, gated(ss, rest_bind(rest_start_gossiping, ss)));
ss::is_gossip_running.set(r, gated(ss, rest_bind(rest_is_gossip_running, ss)));
ss::stop_daemon.set(r, gated(ss, rest_bind(rest_stop_daemon)));
ss::is_initialized.set(r, gated(ss, rest_bind(rest_is_initialized, ss)));
ss::join_ring.set(r, gated(ss, rest_bind(rest_join_ring)));
ss::is_joined.set(r, gated(ss, rest_bind(rest_is_joined, ss)));
ss::is_incremental_backups_enabled.set(r, gated(ss, rest_bind(rest_is_incremental_backups_enabled, ctx)));
ss::set_incremental_backups_enabled.set(r, gated(ss, rest_bind(rest_set_incremental_backups_enabled, ctx)));
ss::rebuild.set(r, gated(ss, rest_bind(rest_rebuild, ss)));
ss::bulk_load.set(r, gated(ss, rest_bind(rest_bulk_load)));
ss::bulk_load_async.set(r, gated(ss, rest_bind(rest_bulk_load_async)));
ss::reschedule_failed_deletions.set(r, gated(ss, rest_bind(rest_reschedule_failed_deletions)));
ss::sample_key_range.set(r, gated(ss, rest_bind(rest_sample_key_range)));
ss::reset_local_schema.set(r, gated(ss, rest_bind(rest_reset_local_schema, ss)));
ss::set_trace_probability.set(r, gated(ss, rest_bind(rest_set_trace_probability)));
ss::get_trace_probability.set(r, gated(ss, rest_bind(rest_get_trace_probability)));
ss::get_slow_query_info.set(r, gated(ss, rest_bind(rest_get_slow_query_info)));
ss::set_slow_query.set(r, gated(ss, rest_bind(rest_set_slow_query)));
ss::deliver_hints.set(r, gated(ss, rest_bind(rest_deliver_hints)));
ss::get_cluster_name.set(r, gated(ss, rest_bind(rest_get_cluster_name, ss)));
ss::get_partitioner_name.set(r, gated(ss, rest_bind(rest_get_partitioner_name, ss)));
ss::get_tombstone_warn_threshold.set(r, gated(ss, rest_bind(rest_get_tombstone_warn_threshold)));
ss::set_tombstone_warn_threshold.set(r, gated(ss, rest_bind(rest_set_tombstone_warn_threshold)));
ss::get_tombstone_failure_threshold.set(r, gated(ss, rest_bind(rest_get_tombstone_failure_threshold)));
ss::set_tombstone_failure_threshold.set(r, gated(ss, rest_bind(rest_set_tombstone_failure_threshold)));
ss::get_batch_size_failure_threshold.set(r, gated(ss, rest_bind(rest_get_batch_size_failure_threshold)));
ss::set_batch_size_failure_threshold.set(r, gated(ss, rest_bind(rest_set_batch_size_failure_threshold)));
ss::set_hinted_handoff_throttle_in_kb.set(r, gated(ss, rest_bind(rest_set_hinted_handoff_throttle_in_kb)));
ss::get_exceptions.set(r, gated(ss, rest_bind(rest_get_exceptions, ss)));
ss::get_total_hints_in_progress.set(r, gated(ss, rest_bind(rest_get_total_hints_in_progress)));
ss::get_total_hints.set(r, gated(ss, rest_bind(rest_get_total_hints)));
ss::get_ownership.set(r, gated(ss, rest_bind(rest_get_ownership, ctx, ss)));
ss::get_effective_ownership.set(r, gated(ss, rest_bind(rest_get_effective_ownership, ctx, ss)));
ss::retrain_dict.set(r, gated(ss, rest_bind(rest_retrain_dict, ctx, ss, group0_client)));
ss::estimate_compression_ratios.set(r, gated(ss, rest_bind(rest_estimate_compression_ratios, ctx, ss)));
ss::sstable_info.set(r, gated(ss, rest_bind(rest_sstable_info, ctx)));
ss::logstor_info.set(r, gated(ss, rest_bind(rest_logstor_info, ctx)));
ss::reload_raft_topology_state.set(r, gated(ss, rest_bind(rest_reload_raft_topology_state, ss, group0_client)));
ss::upgrade_to_raft_topology.set(r, gated(ss, rest_bind(rest_upgrade_to_raft_topology, ss)));
ss::raft_topology_upgrade_status.set(r, gated(ss, rest_bind(rest_raft_topology_upgrade_status, ss)));
ss::raft_topology_get_cmd_status.set(r, gated(ss, rest_bind(rest_raft_topology_get_cmd_status, ss)));
ss::move_tablet.set(r, gated(ss, rest_bind(rest_move_tablet, ctx, ss)));
ss::add_tablet_replica.set(r, gated(ss, rest_bind(rest_add_tablet_replica, ctx, ss)));
ss::del_tablet_replica.set(r, gated(ss, rest_bind(rest_del_tablet_replica, ctx, ss)));
ss::repair_tablet.set(r, gated(ss, rest_bind(rest_repair_tablet, ctx, ss)));
ss::tablet_balancing_enable.set(r, gated(ss, rest_bind(rest_tablet_balancing_enable, ss)));
ss::create_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_create_vnode_tablet_migration, ctx, ss)));
ss::get_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_get_vnode_tablet_migration, ctx, ss)));
ss::set_vnode_tablet_migration_node_storage_mode.set(r, gated(ss, rest_bind(rest_set_vnode_tablet_migration_node_storage_mode, ctx, ss)));
ss::finalize_vnode_tablet_migration.set(r, gated(ss, rest_bind(rest_finalize_vnode_tablet_migration, ctx, ss)));
ss::quiesce_topology.set(r, gated(ss, rest_bind(rest_quiesce_topology, ss)));
sp::get_schema_versions.set(r, gated(ss, rest_bind(rest_get_schema_versions, ss)));
ss::drop_quarantined_sstables.set(r, gated(ss, rest_bind(rest_drop_quarantined_sstables, ctx, ss)));
}
void unset_storage_service(http_context& ctx, routes& r) {

View File

@@ -113,8 +113,8 @@ static category_set parse_audit_categories(const sstring& data) {
return result;
}
static std::map<sstring, std::set<sstring>> parse_audit_tables(const sstring& data) {
std::map<sstring, std::set<sstring>> result;
static audit::audited_tables_t parse_audit_tables(const sstring& data) {
audit::audited_tables_t result;
if (!data.empty()) {
std::vector<sstring> tokens;
boost::split(tokens, data, boost::is_any_of(","));
@@ -139,8 +139,8 @@ static std::map<sstring, std::set<sstring>> parse_audit_tables(const sstring& da
return result;
}
static std::set<sstring> parse_audit_keyspaces(const sstring& data) {
std::set<sstring> result;
static audit::audited_keyspaces_t parse_audit_keyspaces(const sstring& data) {
audit::audited_keyspaces_t result;
if (!data.empty()) {
std::vector<sstring> tokens;
boost::split(tokens, data, boost::is_any_of(","));
@@ -156,8 +156,8 @@ audit::audit(locator::shared_token_metadata& token_metadata,
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
audited_keyspaces_t&& audited_keyspaces,
audited_tables_t&& audited_tables,
category_set&& audited_categories,
const db::config& cfg)
: _token_metadata(token_metadata)
@@ -165,8 +165,8 @@ audit::audit(locator::shared_token_metadata& token_metadata,
, _audited_tables(std::move(audited_tables))
, _audited_categories(std::move(audited_categories))
, _cfg(cfg)
, _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<std::set<sstring>>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))
, _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<std::map<sstring, std::set<sstring>>>(new_value, parse_audit_tables, _audited_tables); }))
, _cfg_keyspaces_observer(cfg.audit_keyspaces.observe([this] (sstring const& new_value){ update_config<audited_keyspaces_t>(new_value, parse_audit_keyspaces, _audited_keyspaces); }))
, _cfg_tables_observer(cfg.audit_tables.observe([this] (sstring const& new_value){ update_config<audited_tables_t>(new_value, parse_audit_tables, _audited_tables); }))
, _cfg_categories_observer(cfg.audit_categories.observe([this] (sstring const& new_value){ update_config<category_set>(new_value, parse_audit_categories, _audited_categories); }))
{
_storage_helper_ptr = create_storage_helper(std::move(audit_modes), qp, mm);
@@ -181,8 +181,8 @@ future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token
return make_ready_future<>();
}
category_set audited_categories = parse_audit_categories(cfg.audit_categories());
std::map<sstring, std::set<sstring>> audited_tables = parse_audit_tables(cfg.audit_tables());
std::set<sstring> audited_keyspaces = parse_audit_keyspaces(cfg.audit_keyspaces());
audit::audited_tables_t audited_tables = parse_audit_tables(cfg.audit_tables());
audit::audited_keyspaces_t audited_keyspaces = parse_audit_keyspaces(cfg.audit_keyspaces());
logger.info("Audit is enabled. Auditing to: \"{}\", with the following categories: \"{}\", keyspaces: \"{}\", and tables: \"{}\"",
cfg.audit(), cfg.audit_categories(), cfg.audit_keyspaces(), cfg.audit_tables());
@@ -194,22 +194,36 @@ future<> audit::start_audit(const db::config& cfg, sharded<locator::shared_token
std::move(audited_keyspaces),
std::move(audited_tables),
std::move(audited_categories),
std::cref(cfg))
.then([&cfg] {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {
return local_audit.start(cfg);
std::cref(cfg));
}
future<> audit::start_storage(const db::config& cfg) {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([&cfg] (audit& local_audit) {
return local_audit._storage_helper_ptr->start(cfg).then([&local_audit] {
local_audit._storage_running = true;
});
});
}
future<> audit::stop_storage() {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit_instance().invoke_on_all([] (audit& local_audit) {
local_audit._storage_running = false;
return local_audit._storage_helper_ptr->stop();
});
}
future<> audit::stop_audit() {
if (!audit_instance().local_is_initialized()) {
return make_ready_future<>();
}
return audit::audit::audit_instance().invoke_on_all([] (auto& local_audit) {
SCYLLA_ASSERT(!local_audit._storage_running);
return local_audit.shutdown();
}).then([] {
return audit::audit::audit_instance().stop();
@@ -223,14 +237,6 @@ audit_info_ptr audit::create_audit_info(statement_category cat, const sstring& k
return std::make_unique<audit_info>(cat, keyspace, table, batch);
}
future<> audit::start(const db::config& cfg) {
return _storage_helper_ptr->start(cfg);
}
future<> audit::stop() {
return _storage_helper_ptr->stop();
}
future<> audit::shutdown() {
return make_ready_future<>();
}
@@ -241,6 +247,12 @@ future<> audit::log(const audit_info& audit_info, const service::client_state& c
const sstring& username = client_state.user() ? client_state.user()->name.value_or(anonymous_username) : no_username;
socket_address client_ip = client_state.get_client_address().addr();
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (!_storage_running) {
on_internal_error_noexcept(logger, fmt::format("Audit log dropped (storage not ready): node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
audit_info.query(), client_ip, audit_info.table(), username));
return make_ready_future<>();
}
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Log written: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
@@ -286,6 +298,11 @@ future<> inspect(const audit_info_alternator& ai, const service::client_state& c
future<> audit::log_login(const sstring& username, socket_address client_ip, bool error) noexcept {
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (!_storage_running) {
on_internal_error_noexcept(logger, fmt::format("Audit login log dropped (storage not ready): node_ip {} client_ip {} username {} error {}",
node_ip, client_ip, username, error ? "true" : "false"));
return make_ready_future<>();
}
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Login log written: node_ip {}, client_ip {}, username {}, error {}",
node_ip, client_ip, username, error ? "true" : "false");
@@ -304,7 +321,7 @@ future<> inspect_login(const sstring& username, socket_address client_ip, bool e
return audit::local_audit_instance().log_login(username, client_ip, error);
}
bool audit::should_log_table(const sstring& keyspace, const sstring& name) const {
bool audit::should_log_table(std::string_view keyspace, std::string_view name) const {
auto keyspace_it = _audited_tables.find(keyspace);
return keyspace_it != _audited_tables.cend() && keyspace_it->second.find(name) != keyspace_it->second.cend();
}
@@ -319,8 +336,8 @@ bool audit::will_log(statement_category cat, std::string_view keyspace, std::str
// so it is logged whenever the category matches.
return _audited_categories.contains(cat)
&& (keyspace.empty()
|| _audited_keyspaces.find(sstring(keyspace)) != _audited_keyspaces.cend()
|| should_log_table(sstring(keyspace), sstring(table))
|| _audited_keyspaces.find(keyspace) != _audited_keyspaces.cend()
|| should_log_table(keyspace, table)
|| cat == statement_category::AUTH
|| cat == statement_category::ADMIN
|| cat == statement_category::DCL);

View File

@@ -129,13 +129,19 @@ public:
class storage_helper;
class audit final : public seastar::async_sharded_service<audit> {
public:
// Transparent comparator (std::less<>) enables heterogeneous lookup with
// string_view keys.
using audited_keyspaces_t = std::set<sstring, std::less<>>;
using audited_tables_t = std::map<sstring, std::set<sstring, std::less<>>, std::less<>>;
private:
locator::shared_token_metadata& _token_metadata;
std::set<sstring> _audited_keyspaces;
// Maps keyspace name to set of table names in that keyspace
std::map<sstring, std::set<sstring>> _audited_tables;
audited_keyspaces_t _audited_keyspaces;
audited_tables_t _audited_tables;
category_set _audited_categories;
std::unique_ptr<storage_helper> _storage_helper_ptr;
bool _storage_running = false;
const db::config& _cfg;
utils::observer<sstring> _cfg_keyspaces_observer;
@@ -145,7 +151,7 @@ class audit final : public seastar::async_sharded_service<audit> {
template<class T>
void update_config(const sstring & new_value, std::function<T(const sstring&)> parse_func, T& cfg_parameter);
bool should_log_table(const sstring& keyspace, const sstring& name) const;
bool should_log_table(std::string_view keyspace, std::string_view name) const;
public:
static seastar::sharded<audit>& audit_instance() {
// FIXME: leaked intentionally to avoid shutdown problems, see #293
@@ -158,19 +164,19 @@ public:
return audit_instance().local();
}
static future<> start_audit(const db::config& cfg, sharded<locator::shared_token_metadata>& stm, sharded<cql3::query_processor>& qp, sharded<service::migration_manager>& mm);
static future<> start_storage(const db::config& cfg);
static future<> stop_storage();
static future<> stop_audit();
static audit_info_ptr create_audit_info(statement_category cat, const sstring& keyspace, const sstring& table, bool batch = false);
audit(locator::shared_token_metadata& stm,
cql3::query_processor& qp,
service::migration_manager& mm,
std::set<sstring>&& audit_modes,
std::set<sstring>&& audited_keyspaces,
std::map<sstring, std::set<sstring>>&& audited_tables,
audited_keyspaces_t&& audited_keyspaces,
audited_tables_t&& audited_tables,
category_set&& audited_categories,
const db::config& cfg);
~audit();
future<> start(const db::config& cfg);
future<> stop();
future<> shutdown();
bool should_log(const audit_info& audit_info) const;
bool will_log(statement_category cat, std::string_view keyspace = {}, std::string_view table = {}) const;

View File

@@ -185,24 +185,14 @@ future<lw_shared_ptr<cache::role_record>> cache::fetch_role(const role_name_t& r
static const sstring q = format("SELECT role, name, value FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, ROLE_ATTRIBUTES_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
if (!r.has("value")) {
continue;
}
rec->attributes[r.get_as<sstring>("name")] =
r.get_as<sstring>("value");
co_await coroutine::maybe_yield();
}
}
// permissions
{
static const sstring q = format("SELECT role, resource, permissions FROM {}.{} WHERE role = ?", db::system_keyspace::NAME, PERMISSIONS_CF);
auto rs = co_await fetch(q);
for (const auto& r : *rs) {
auto resource = r.get_as<sstring>("resource");
auto perms_strings = r.get_set<sstring>("permissions");
std::unordered_set<sstring> perms_set(perms_strings.begin(), perms_strings.end());
auto pset = permissions::from_strings(perms_set);
rec->permissions[std::move(resource)] = std::move(pset);
co_await coroutine::maybe_yield();
}
}
co_return rec;
}

View File

@@ -44,7 +44,6 @@ public:
std::unordered_set<role_name_t> members;
sstring salted_hash;
std::unordered_map<sstring, sstring, sstring_hash, sstring_eq> attributes;
std::unordered_map<sstring, permission_set, sstring_hash, sstring_eq> permissions;
private:
friend cache;
// cached permissions include effects of role's inheritance

View File

@@ -76,7 +76,11 @@ default_authorizer::authorize(const role_or_anonymous& maybe_role, const resourc
if (results->empty()) {
co_return permissions::NONE;
}
co_return permissions::from_strings(results->one().get_set<sstring>(PERMISSIONS_NAME));
const auto& row = results->one();
if (!row.has(PERMISSIONS_NAME)) {
co_return permissions::NONE;
}
co_return permissions::from_strings(row.get_set<sstring>(PERMISSIONS_NAME));
}
future<>

View File

@@ -258,13 +258,11 @@ future<> ldap_role_manager::start() {
} catch (const seastar::sleep_aborted&) {
co_return; // ignore
}
co_await _cache.container().invoke_on_all([] (cache& c) -> future<> {
try {
co_await c.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
});
try {
co_await _cache.reload_all_permissions();
} catch (...) {
mylog.warn("Cache reload all permissions failed: {}", std::current_exception());
}
}
});
return _std_mgr.start();

View File

@@ -157,15 +157,12 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
return create_legacy_keyspace_if_missing(mm);
});
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
co_await when_all_succeed(_authorizer->start(), _authenticator->start()).discard_result();
// Authorizer must be started before the permission loader is set,
// because the loader calls _authorizer->authorize().
// The loader must be set before starting the role manager, because
// LDAP role manager starts a pruner fiber that calls
// reload_all_permissions() which asserts _permission_loader is set.
co_await _authorizer->start();
if (!_used_by_maintenance_socket) {
// Maintenance socket mode can't cache permissions because it has
// different authorizer. We can't mix cached permissions, they could be
@@ -174,12 +171,27 @@ future<> service::start(::service::migration_manager& mm, db::system_keyspace& s
&service::get_uncached_permissions,
this, std::placeholders::_1, std::placeholders::_2));
}
co_await _role_manager->start();
if (this_shard_id() == 0) {
// Role manager and password authenticator have this odd startup
// mechanism where they asynchronously create the superuser role
// in the background. Correct password creation depends on role
// creation therefore we need to wait here.
co_await _role_manager->ensure_superuser_is_created();
}
// Authenticator must be started after ensure_superuser_is_created()
// because password_authenticator queries system.roles for the
// superuser entry created by the role manager.
co_await _authenticator->start();
}
future<> service::stop() {
_as.request_abort();
// Reverse of start() order.
co_await _authenticator->stop();
co_await _role_manager->stop();
_cache.set_permission_loader(nullptr);
return when_all_succeed(_role_manager->stop(), _authorizer->stop(), _authenticator->stop()).discard_result();
co_await _authorizer->stop();
}
future<> service::ensure_superuser_is_created() {

View File

@@ -1625,7 +1625,7 @@ struct process_change_visitor {
if (_enable_updating_state) {
if (_request_options.alternator && _alternator_schema_has_no_clustering_key && _clustering_row_states.empty()) {
// Alternator's table can be with or without clustering key. If the clustering key exists,
// delete request will be `clustered_row_delete` and will be hanlded there.
// delete request will be `clustered_row_delete` and will be handled there.
// If the clustering key doesn't exist, delete request will be `partition_delete` and will be handled here.
// The no-clustering-key case is slightly tricky, because insert of such item is handled by `clustered_row_cells`
// and has some value as clustering_key (the value currently seems to be empty bytes object).
@@ -1933,7 +1933,7 @@ public:
if (_options.alternator && !_alternator_clustering_keys_to_ignore.empty()) {
// we filter mutations for Alternator's changes here.
// We do it per mutation object (user might submit a batch of those in one go
// and some might be splitted because of different timestamps),
// and some might be split because of different timestamps),
// ignore key set is cleared afterwards.
// If single mutation object contains two separate changes to the same row
// and at least one of them is ignored, all of them will be ignored.

View File

@@ -267,7 +267,7 @@ struct extract_row_visitor {
visit_collection(v);
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("extract_changes: unknown collection type:", o.name()));
throw std::runtime_error(format("extract_changes: unknown collection type: {}", o.name()));
}
));
}

View File

@@ -137,6 +137,24 @@ endfunction()
option(Scylla_WITH_DEBUG_INFO "Enable debug info" OFF)
# Time trace profiling: adds -ftime-trace to all C++ compilations (Clang only).
# Each .o produces a companion .json file in the build directory that can be
# analyzed with ClangBuildAnalyzer or loaded in chrome://tracing.
#
# Usage:
# cmake -DScylla_TIME_TRACE=ON ...
# ninja
# # Analyze results (requires ClangBuildAnalyzer):
# ClangBuildAnalyzer --all <build-dir> capture.bin
# ClangBuildAnalyzer --analyze capture.bin
option(Scylla_TIME_TRACE "Enable Clang -ftime-trace for build profiling" OFF)
if(Scylla_TIME_TRACE)
if(NOT CMAKE_CXX_COMPILER_ID STREQUAL "Clang")
message(FATAL_ERROR "Scylla_TIME_TRACE requires Clang (found ${CMAKE_CXX_COMPILER_ID})")
endif()
add_compile_options(-ftime-trace)
endif()
macro(update_build_flags config)
cmake_parse_arguments (
parsed_args

View File

@@ -240,7 +240,7 @@ static max_purgeable get_max_purgeable_timestamp(const compaction_group_view& ta
// and if the memtable also contains the key we're calculating max purgeable timestamp for.
// First condition helps to not penalize the common scenario where memtable only contains
// newer data.
if (memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
if (!table_s.skip_memtable_for_tombstone_gc() && memtable_min_timestamp <= compacting_max_timestamp && table_s.memtable_has_key(dk)) {
timestamp = memtable_min_timestamp;
source = max_purgeable::timestamp_source::memtable_possibly_shadowing_data;
}

View File

@@ -39,6 +39,9 @@ public:
virtual future<lw_shared_ptr<const sstables::sstable_set>> main_sstable_set() const = 0;
virtual future<lw_shared_ptr<const sstables::sstable_set>> maintenance_sstable_set() const = 0;
virtual lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc() const = 0;
// Returns true when tombstone GC considers only the repaired sstable set, meaning the
// memtable does not need to be consulted (its data is always newer than any GC-eligible tombstone).
virtual bool skip_memtable_for_tombstone_gc() const noexcept = 0;
virtual std::unordered_set<sstables::shared_sstable> fully_expired_sstables(const std::vector<sstables::shared_sstable>& sstables, gc_clock::time_point compaction_time) const = 0;
virtual const std::vector<sstables::shared_sstable>& compacted_undeleted_sstables() const noexcept = 0;
virtual compaction_strategy& get_compaction_strategy() const noexcept = 0;

View File

@@ -1088,7 +1088,7 @@ void compaction_manager::register_metrics() {
sm::make_gauge("normalized_backlog", [this] { return _last_backlog / available_memory(); },
sm::description("Holds the sum of normalized compaction backlog for all tables in the system. Backlog is normalized by dividing backlog by shard's available memory.")),
sm::make_counter("validation_errors", [this] { return _validation_errors; },
sm::description("Holds the number of encountered validation errors.")),
sm::description("Holds the number of encountered validation errors.")).set_skip_when_empty(),
});
}

View File

@@ -406,7 +406,11 @@ commitlog_total_space_in_mb: -1
# In short, `ms` needs more CPU during sstable writes,
# but should behave better during reads,
# although it might behave worse for very long clustering keys.
#
# `ms` sstable format works even better with `column_index_size_in_kb` set to 1,
# so keep those two settings in sync (either both set, or both unset).
sstable_format: ms
column_index_size_in_kb: 1
# Auto-scaling of the promoted index prevents running out of memory
# when the promoted index grows too large (due to partitions with many rows

View File

@@ -285,8 +285,12 @@ def generate_compdb(compdb, ninja, buildfile, modes):
os.symlink(compdb_target, compdb)
except FileExistsError:
# if there is already a valid compile_commands.json link in the
# source root, we are done.
pass
# source root, we are done. if it's a stale link, update it.
if os.path.islink(compdb):
current_target = os.readlink(compdb)
if not os.path.exists(current_target):
os.unlink(compdb)
os.symlink(compdb_target, compdb)
return
@@ -593,6 +597,7 @@ scylla_tests = set([
'test/boost/linearizing_input_stream_test',
'test/boost/lister_test',
'test/boost/locator_topology_test',
'test/boost/lock_tables_metadata_test',
'test/boost/log_heap_test',
'test/boost/logalloc_standard_allocator_segment_pool_backend_test',
'test/boost/logalloc_test',
@@ -853,6 +858,10 @@ arg_parser.add_argument('--coverage', action = 'store_true', help = 'Compile scy
arg_parser.add_argument('--build-dir', action='store', default='build',
help='Build directory path')
arg_parser.add_argument('--disable-precompiled-header', action='store_true', default=False, help='Disable precompiled header for scylla binary')
arg_parser.add_argument('--time-trace', action='store_true', default=False,
help='Enable Clang -ftime-trace for build profiling. '
'Each .o produces a .json file analyzable with '
'ClangBuildAnalyzer or chrome://tracing')
arg_parser.add_argument('-h', '--help', action='store_true', help='show this help message and exit')
args = arg_parser.parse_args()
if args.help:
@@ -1659,6 +1668,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/auth_cache_test.cc',
'test/boost/auth_test.cc',
'test/boost/batchlog_manager_test.cc',
'test/boost/table_helper_test.cc',
'test/boost/cache_algorithm_test.cc',
'test/boost/castas_fcts_test.cc',
'test/boost/cdc_test.cc',
@@ -1710,7 +1720,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/sstable_compression_config_test.cc',
'test/boost/sstable_directory_test.cc',
'test/boost/sstable_set_test.cc',
'test/boost/sstable_tablet_streaming.cc',
'test/boost/sstable_tablet_streaming_test.cc',
'test/boost/statement_restrictions_test.cc',
'test/boost/storage_proxy_test.cc',
'test/boost/tablets_test.cc',
@@ -1965,6 +1975,9 @@ user_cflags += ' -fextend-variable-liveness=none'
if args.target != '':
user_cflags += ' -march=' + args.target
if args.time_trace:
user_cflags += ' -ftime-trace'
for mode in modes:
# Those flags are passed not only to Scylla objects, but also to libraries
# that we compile ourselves.
@@ -2457,6 +2470,9 @@ def write_build_file(f,
command = reloc/build_deb.sh --reloc-pkg $in --builddir $out
rule unified
command = unified/build_unified.sh --build-dir $builddir/$mode --unified-pkg $out
rule collect_pkgs
command = rm -rf $out && mkdir -p $out && cp $pkgs $out/
description = COLLECT $out
rule rust_header
command = cxxbridge --include rust/cxx.h --header $in > $out
description = RUST_HEADER $out
@@ -2942,6 +2958,8 @@ def write_build_file(f,
build dist-tar: phony dist-unified-tar dist-server-tar dist-python3-tar dist-cqlsh-tar
build dist: phony dist-unified dist-server dist-python3 dist-cqlsh
build collect-dist: phony {' '.join([f'collect-dist-{mode}' for mode in default_modes])}
'''))
f.write(textwrap.dedent(f'''\
@@ -2949,7 +2967,28 @@ def write_build_file(f,
rule dist-check
command = ./tools/testing/dist-check/dist-check.sh --mode $mode
'''))
deb_arch = {'x86_64': 'amd64', 'aarch64': 'arm64'}[arch]
deb_ver = f'{scylla_version}-{scylla_release}-1'
rpm_ver = f'{scylla_version}-{scylla_release}'
for mode in build_modes:
server_rpms_dir = f'$builddir/dist/{mode}/redhat/RPMS/{arch}'
server_rpms = [f'{server_rpms_dir}/{scylla_product}{suffix}-{rpm_ver}.{arch}.rpm'
for suffix in ['', '-server', '-server-debuginfo', '-conf', '-kernel-conf', '-node-exporter']]
cqlsh_rpms = [f'tools/cqlsh/build/redhat/RPMS/{arch}/{scylla_product}-cqlsh-{rpm_ver}.{arch}.rpm']
python3_rpms = [f'tools/python3/build/redhat/RPMS/{arch}/{scylla_product}-python3-{rpm_ver}.{arch}.rpm']
all_rpms = server_rpms + cqlsh_rpms + python3_rpms
server_deb_dir = f'$builddir/dist/{mode}/debian'
server_debs = [f'{server_deb_dir}/{scylla_product}{suffix}_{deb_ver}_{deb_arch}.deb'
for suffix in ['', '-server', '-server-dbg', '-conf', '-kernel-conf', '-node-exporter']]
server_debs += [f'{server_deb_dir}/scylla-enterprise{suffix}_{deb_ver}_all.deb'
for suffix in ['', '-server', '-conf', '-kernel-conf', '-node-exporter']]
cqlsh_debs = [f'tools/cqlsh/build/debian/{scylla_product}-cqlsh_{deb_ver}_{deb_arch}.deb',
f'tools/cqlsh/build/debian/scylla-enterprise-cqlsh_{deb_ver}_all.deb']
python3_debs = [f'tools/python3/build/debian/{scylla_product}-python3_{deb_ver}_{deb_arch}.deb',
f'tools/python3/build/debian/scylla-enterprise-python3_{deb_ver}_all.deb']
all_debs = server_debs + cqlsh_debs + python3_debs
f.write(textwrap.dedent(f'''\
build $builddir/{mode}/dist/tar/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-python3-package.tar.gz: copy tools/python3/build/{scylla_product}-python3-{scylla_version}-{scylla_release}.{arch}.tar.gz
@@ -2957,6 +2996,11 @@ def write_build_file(f,
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/tar/{scylla_product}-cqlsh-package.tar.gz: copy tools/cqlsh/build/{scylla_product}-cqlsh-{scylla_version}-{scylla_release}.{arch}.tar.gz
build $builddir/{mode}/dist/rpm: collect_pkgs | {' '.join(all_rpms)} $builddir/dist/{mode}/redhat dist-cqlsh-rpm dist-python3-rpm
pkgs = {' '.join(all_rpms)}
build $builddir/{mode}/dist/deb: collect_pkgs | {' '.join(all_debs)} $builddir/dist/{mode}/debian dist-cqlsh-deb dist-python3-deb
pkgs = {' '.join(all_debs)}
build collect-dist-{mode}: phony $builddir/{mode}/dist/rpm $builddir/{mode}/dist/deb
build {mode}-dist: phony dist-server-{mode} dist-server-debuginfo-{mode} dist-python3-{mode} dist-unified-{mode} dist-cqlsh-{mode}
build dist-{mode}: phony {mode}-dist
build dist-check-{mode}: dist-check

View File

@@ -136,9 +136,9 @@ public:
{}
future<> insert(auth::authenticated_user user, cql3::prepared_cache_key_type prep_cache_key, value_type v) noexcept {
return _cache.get_ptr(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return _cache.insert(key_type(std::move(user), std::move(prep_cache_key)), [v = std::move(v)] (const cache_key_type&) mutable {
return make_ready_future<value_type>(std::move(v));
}).discard_result();
});
}
value_ptr find(const auth::authenticated_user& user, const cql3::prepared_cache_key_type& prep_cache_key) {

View File

@@ -1070,7 +1070,7 @@ try_prepare_count_rows(const expr::function_call& fc, data_dictionary::database
.args = {},
};
} else {
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument", fc.args[0]));
throw exceptions::invalid_request_exception(format("count() expects a column or the literal 1 as an argument, got {}", fc.args[0]));
}
}
}

File diff suppressed because it is too large Load Diff

View File

@@ -23,15 +23,113 @@ namespace cql3 {
namespace restrictions {
/// A set of discrete values.
using value_list = std::vector<managed_bytes>; // Sorted and deduped using value comparator.
/// General set of values. Empty set and single-element sets are always value_list. interval is
/// never singular and never has start > end. Universal set is a interval with both bounds null.
using value_set = std::variant<value_list, interval<managed_bytes>>;
// For some boolean expression (say (X = 3) = TRUE, this represents a function that solves for X.
// (here, it would return 3). The expression is obtained by equating some factors of the WHERE
// clause to TRUE.
using solve_for_t = std::function<value_set (const query_options&)>;
struct on_row {
bool operator==(const on_row&) const = default;
};
struct on_column {
const column_definition* column;
bool operator==(const on_column&) const = default;
};
// Placeholder type indicating we're solving for the partition key token.
struct on_partition_key_token {
const ::schema* schema;
bool operator==(const on_partition_key_token&) const = default;
};
struct on_clustering_key_prefix {
std::vector<const column_definition*> columns;
bool operator==(const on_clustering_key_prefix&) const = default;
};
// A predicate on a column or a combination of columns. The WHERE clause analyzer
// will attempt to convert predicates (that return true or false for a particular row)
// to solvers (that return the set of column values that satisfy the predicate) when possible.
struct predicate {
// A function that returns the set of values that satisfy the filter. Can be unset,
// in which case the filter must be interpreted.
solve_for_t solve_for;
// The original filter for this column.
expr::expression filter;
// What column the predicate can be solved for
std::variant<
on_row, // cannot determine, so predicate is on entire row
on_column, // solving for a single column: e.g. c1 = 3
on_partition_key_token, // solving for the token, e.g. token(pk1, pk2) >= :var
on_clustering_key_prefix // solving for a clustering key prefix: e.g. (ck1, ck2) >= (3, 4)
> on;
// Whether the returned value_set will resolve to a single value.
bool is_singleton = false;
// Whether the returned value_set follows CQL comparison semantics
bool comparable = true;
bool is_multi_column = false;
bool is_not_null_single_column = false;
bool equality = false; // operator is EQ
bool is_in = false; // operator is IN
bool is_slice = false; // operator is LT/LTE/GT/GTE
bool is_upper_bound = false; // operator is LT/LTE
bool is_lower_bound = false; // operator is GT/GTE
expr::comparison_order order = expr::comparison_order::cql;
std::optional<expr::oper_t> op; // the binary operator, if any
bool is_subscript = false; // whether the LHS is a subscript (map element access)
};
///In some cases checking if columns have indexes is undesired of even
///impossible, because e.g. the query runs on a pseudo-table, which does not
///have an index-manager, or even a table object.
using check_indexes = bool_class<class check_indexes_tag>;
// A function that returns the partition key ranges for a query. It is the solver of
// WHERE clause fragments such as WHERE token(pk) > 1 or WHERE pk1 IN :list1 AND pk2 IN :list2.
using get_partition_key_ranges_fn_t = std::function<dht::partition_range_vector (const query_options&)>;
// A function that returns the clustering key ranges for a query. It is the solver of
// WHERE clause fragments such as WHERE ck > 1 or WHERE (ck1, ck2) > (1, 2).
using get_clustering_bounds_fn_t = std::function<std::vector<query::clustering_range> (const query_options& options)>;
// A function that returns a singleton value, usable for a key (e.g. bytes_opt)
using get_singleton_value_fn_t = std::function<bytes_opt (const query_options&)>;
struct no_partition_range_restrictions {
};
struct token_range_restrictions {
predicate token_restrictions;
};
struct single_column_partition_range_restrictions {
std::vector<predicate> per_column_restrictions;
};
using partition_range_restrictions = std::variant<
no_partition_range_restrictions,
token_range_restrictions,
single_column_partition_range_restrictions>;
// A map of per-column predicate vectors, ordered by schema position.
using single_column_predicate_vectors = std::map<const column_definition*, std::vector<predicate>, expr::schema_pos_column_definition_comparator>;
/**
* The restrictions corresponding to the relations specified on the where-clause of CQL query.
*/
class statement_restrictions {
struct private_tag {}; // Tag for private constructor
private:
schema_ptr _schema;
@@ -81,7 +179,7 @@ private:
bool _has_queriable_regular_index = false, _has_queriable_pk_index = false, _has_queriable_ck_index = false;
bool _has_multi_column; ///< True iff _clustering_columns_restrictions has a multi-column restriction.
std::optional<expr::expression> _where; ///< The entire WHERE clause.
std::vector<expr::expression> _where; ///< The entire WHERE clause (factorized).
/// Parts of _where defining the clustering slice.
///
@@ -96,7 +194,7 @@ private:
/// 4.4 elements other than the last have only EQ or IN atoms
/// 4.5 the last element has only EQ, IN, or is_slice() atoms
/// 5. if multi-column, then each element is a binary_operator
std::vector<expr::expression> _clustering_prefix_restrictions;
std::vector<predicate> _clustering_prefix_restrictions;
/// Like _clustering_prefix_restrictions, but for the indexing table (if this is an index-reading statement).
/// Recall that the index-table CK is (token, PK, CK) of the base table for a global index and (indexed column,
@@ -105,7 +203,7 @@ private:
/// Elements are conjunctions of single-column binary operators with the same LHS.
/// Element order follows the indexing-table clustering key.
/// In case of a global index the first element's (token restriction) RHS is a dummy value, it is filled later.
std::optional<std::vector<expr::expression>> _idx_tbl_ck_prefix;
std::optional<std::vector<predicate>> _idx_tbl_ck_prefix;
/// Parts of _where defining the partition range.
///
@@ -113,16 +211,25 @@ private:
/// binary_operators on token. If single-column restrictions define the partition range, each element holds
/// restrictions for one partition column. Each partition column has a corresponding element, but the elements
/// are in arbitrary order.
std::vector<expr::expression> _partition_range_restrictions;
partition_range_restrictions _partition_range_restrictions;
bool _partition_range_is_simple; ///< False iff _partition_range_restrictions imply a Cartesian product.
check_indexes _check_indexes = check_indexes::yes;
/// Columns that appear on the LHS of an EQ restriction (not IN).
/// For multi-column EQ like (ck1, ck2) = (1, 2), all columns in the tuple are included.
std::unordered_set<const column_definition*> _columns_with_eq;
std::vector<const column_definition*> _column_defs_for_filtering;
schema_ptr _view_schema;
std::optional<secondary_index::index> _idx_opt;
expr::expression _idx_restrictions = expr::conjunction({});
get_partition_key_ranges_fn_t _get_partition_key_ranges_fn;
get_clustering_bounds_fn_t _get_clustering_bounds_fn;
get_clustering_bounds_fn_t _get_global_index_clustering_ranges_fn;
get_clustering_bounds_fn_t _get_global_index_token_clustering_ranges_fn;
get_clustering_bounds_fn_t _get_local_index_clustering_ranges_fn;
get_singleton_value_fn_t _value_for_index_partition_key_fn;
public:
/**
* Creates a new empty <code>StatementRestrictions</code>.
@@ -130,9 +237,10 @@ public:
* @param cfm the column family meta data
* @return a new empty <code>StatementRestrictions</code>.
*/
statement_restrictions(schema_ptr schema, bool allow_filtering);
statement_restrictions(private_tag, schema_ptr schema, bool allow_filtering);
friend statement_restrictions analyze_statement_restrictions(
public:
friend shared_ptr<const statement_restrictions> analyze_statement_restrictions(
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
@@ -142,9 +250,15 @@ public:
bool for_view,
bool allow_filtering,
check_indexes do_check_indexes);
friend shared_ptr<const statement_restrictions> make_trivial_statement_restrictions(
schema_ptr schema,
bool allow_filtering);
private:
statement_restrictions(data_dictionary::database db,
// Important: objects of this class captures `this` extensively and so must remain non-copyable.
statement_restrictions(const statement_restrictions&) = delete;
statement_restrictions& operator=(const statement_restrictions&) = delete;
statement_restrictions(private_tag,
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
const expr::expression& where_clause,
@@ -211,10 +325,7 @@ public:
bool has_token_restrictions() const;
// Checks whether the given column has an EQ restriction.
// EQ restriction is `col = ...` or `(col, col2) = ...`
// IN restriction is NOT an EQ restriction, this function will not look for IN restrictions.
// Uses column_defintion::operator== for comparison, columns with the same name but different schema will not be equal.
// Checks whether the given column has an EQ restriction (not IN).
bool has_eq_restriction_on_column(const column_definition&) const;
/**
@@ -224,12 +335,6 @@ public:
*/
std::vector<const column_definition*> get_column_defs_for_filtering(data_dictionary::database db) const;
/**
* Gives a score that the index has - index with the highest score will be chosen
* in find_idx()
*/
int score(const secondary_index::index& index) const;
/**
* Determines the index to be used with the restriction.
* @param db - the data_dictionary::database context (for extracting index manager)
@@ -250,18 +355,8 @@ public:
size_t partition_key_restrictions_size() const;
bool parition_key_restrictions_have_supporting_index(const secondary_index::secondary_index_manager& index_manager, expr::allow_local_index allow_local) const;
size_t clustering_columns_restrictions_size() const;
bool clustering_columns_restrictions_have_supporting_index(
const secondary_index::secondary_index_manager& index_manager,
expr::allow_local_index allow_local) const;
bool multi_column_clustering_restrictions_are_supported_by(const secondary_index::index& index) const;
bounds_slice get_clustering_slice() const;
/**
* Checks if the clustering key has some unrestricted components.
* @return <code>true</code> if the clustering key has some unrestricted components, <code>false</code> otherwise.
@@ -279,15 +374,6 @@ public:
schema_ptr get_view_schema() const { return _view_schema; }
private:
std::pair<std::optional<secondary_index::index>, expr::expression> do_find_idx(const secondary_index::secondary_index_manager& sim) const;
void add_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering, bool for_view);
void add_is_not_restriction(const expr::binary_operator& restr, schema_ptr schema, bool for_view);
void add_single_column_parition_key_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering, bool for_view);
void add_token_partition_key_restriction(const expr::binary_operator& restr);
void add_single_column_clustering_key_restriction(const expr::binary_operator& restr, schema_ptr schema, bool allow_filtering);
void add_multi_column_clustering_key_restriction(const expr::binary_operator& restr);
void add_single_column_nonprimary_key_restriction(const expr::binary_operator& restr);
void process_partition_key_restrictions(bool for_view, bool allow_filtering, statements::statement_type type);
/**
@@ -315,7 +401,17 @@ private:
void add_clustering_restrictions_to_idx_ck_prefix(const schema& idx_tbl_schema);
unsigned int num_clustering_prefix_columns_that_need_not_be_filtered() const;
void calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index(data_dictionary::database db);
void calculate_column_defs_for_filtering_and_erase_restrictions_used_for_index(
data_dictionary::database db,
const single_column_predicate_vectors& sc_pk_pred_vectors,
const single_column_predicate_vectors& sc_ck_pred_vectors,
const single_column_predicate_vectors& sc_nonpk_pred_vectors);
get_partition_key_ranges_fn_t build_partition_key_ranges_fn() const;
get_clustering_bounds_fn_t build_get_clustering_bounds_fn() const;
get_clustering_bounds_fn_t build_get_global_index_clustering_ranges_fn() const;
get_clustering_bounds_fn_t build_get_global_index_token_clustering_ranges_fn() const;
get_clustering_bounds_fn_t build_get_local_index_clustering_ranges_fn() const;
get_singleton_value_fn_t build_value_for_index_partition_key_fn() const;
public:
/**
* Returns the specified range of the partition key.
@@ -389,7 +485,10 @@ public:
private:
/// Prepares internal data for evaluating index-table queries. Must be called before
/// get_local_index_clustering_ranges().
void prepare_indexed_local(const schema& idx_tbl_schema);
void prepare_indexed_local(const schema& idx_tbl_schema,
const single_column_predicate_vectors& sc_pk_pred_vectors,
const single_column_predicate_vectors& sc_ck_pred_vectors,
const single_column_predicate_vectors& sc_nonpk_pred_vectors);
/// Prepares internal data for evaluating index-table queries. Must be called before
/// get_global_index_clustering_ranges() or get_global_index_token_clustering_ranges().
@@ -398,15 +497,18 @@ private:
public:
/// Calculates clustering ranges for querying a global-index table.
std::vector<query::clustering_range> get_global_index_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Calculates clustering ranges for querying a global-index table for queries with token restrictions present.
std::vector<query::clustering_range> get_global_index_token_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Calculates clustering ranges for querying a local-index table.
std::vector<query::clustering_range> get_local_index_clustering_ranges(
const query_options& options, const schema& idx_tbl_schema) const;
const query_options& options) const;
/// Finds the value of partition key of the index table
bytes_opt value_for_index_partition_key(const query_options&) const;
sstring to_string() const;
@@ -416,7 +518,7 @@ public:
bool is_empty() const;
};
statement_restrictions analyze_statement_restrictions(
shared_ptr<const statement_restrictions> analyze_statement_restrictions(
data_dictionary::database db,
schema_ptr schema,
statements::statement_type type,
@@ -427,23 +529,14 @@ statement_restrictions analyze_statement_restrictions(
bool allow_filtering,
check_indexes do_check_indexes);
// Extracts all binary operators which have the given column on their left hand side.
// Extracts only single-column restrictions.
// Does not include multi-column restrictions.
// Does not include token() restrictions.
// Does not include boolean constant restrictions.
// For example "WHERE c = 1 AND (a, c) = (2, 1) AND token(p) < 2 AND FALSE" will return {"c = 1"}.
std::vector<expr::expression> extract_single_column_restrictions_for_column(const expr::expression&, const column_definition&);
shared_ptr<const statement_restrictions> make_trivial_statement_restrictions(
schema_ptr schema,
bool allow_filtering);
// Checks whether this expression is empty - doesn't restrict anything
bool is_empty_restriction(const expr::expression&);
// Finds the value of the given column in the expression
// In case of multpiple possible values calls on_internal_error
bytes_opt value_for(const column_definition&, const expr::expression&, const query_options&);
}
}

View File

@@ -90,6 +90,20 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
auto& current_rf_per_dc = ks.metadata()->strategy_options();
auto new_rf_per_dc = _attrs->get_replication_options();
new_rf_per_dc.erase(ks_prop_defs::REPLICATION_STRATEGY_CLASS_KEY);
// Check if multi-RF change is allowed: all DC changes must be 0->N or N->0.
auto all_changes_are_0_N = [&] {
for (const auto& [dc, new_rf] : new_rf_per_dc) {
auto old_rf_val = size_t(0);
if (auto it = current_rf_per_dc.find(dc); it != current_rf_per_dc.end()) {
old_rf_val = locator::get_replication_factor(it->second);
}
auto new_rf_val = locator::get_replication_factor(new_rf);
if (old_rf_val != new_rf_val && old_rf_val != 0 && new_rf_val != 0) {
return false;
}
}
return true;
};
unsigned total_abs_rfs_diff = 0;
for (const auto& [new_dc, new_rf] : new_rf_per_dc) {
auto old_rf = locator::replication_strategy_config_option(sstring("0"));
@@ -103,7 +117,9 @@ void cql3::statements::alter_keyspace_statement::validate(query_processor& qp, c
// first we need to report non-existing DCs, then if RFs aren't changed by too much.
continue;
}
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2) {
if (total_abs_rfs_diff += get_abs_rf_diff(old_rf, new_rf); total_abs_rfs_diff >= 2 &&
!(qp.proxy().features().keyspace_multi_rf_change && locator::uses_rack_list_exclusively(current_rf_per_dc)
&& locator::uses_rack_list_exclusively(new_ks->strategy_options()) && all_changes_are_0_N())) {
throw exceptions::invalid_request_exception("Only one DC's RF can be changed at a time and not by more than 1");
}
}

View File

@@ -89,6 +89,10 @@ public:
const std::vector<single_statement>& statements() const { return _statements; }
audit::audit_info_ptr audit_info() const {
return audit::audit::create_audit_info(audit::statement_category::DML, sstring(), sstring(), true);
}
virtual bool depends_on(std::string_view ks_name, std::optional<std::string_view> cf_name) const override;
virtual uint32_t get_bound_terms() const override;

View File

@@ -411,10 +411,10 @@ bool ks_prop_defs::get_durable_writes() const {
lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata(sstring ks_name, const locator::token_metadata& tm, const gms::feature_service& feat, const db::config& cfg) {
auto sc = get_replication_strategy_class().value();
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0 for N.T.S. only
// if tablets options have not been specified, but tablets are globally enabled, set the value to 0. The strategy will
// validate it and throw an error if it does not support tablets.
auto enable_tablets = feat.tablets && cfg.enable_tablets_by_default();
std::optional<unsigned> default_initial_tablets = enable_tablets && locator::abstract_replication_strategy::to_qualified_class_name(sc) == "org.apache.cassandra.locator.NetworkTopologyStrategy"
? std::optional<unsigned>(0) : std::nullopt;
std::optional<unsigned> default_initial_tablets = enable_tablets ? std::optional<unsigned>(0) : std::nullopt;
auto initial_tablets = get_initial_tablets(default_initial_tablets, cfg.enforce_tablets());
bool uses_tablets = initial_tablets.has_value();
bool rack_list_enabled = utils::get_local_injector().enter("create_with_numeric") ? false : feat.rack_list_rf;
@@ -440,7 +440,7 @@ lw_shared_ptr<data_dictionary::keyspace_metadata> ks_prop_defs::as_ks_metadata_u
sc = old->strategy_name();
options = old_options;
}
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options());
return data_dictionary::keyspace_metadata::new_keyspace(old->name(), *sc, options, initial_tablets, get_consistency_option(), get_boolean(KW_DURABLE_WRITES, true), get_storage_options(), {}, old->next_strategy_options_opt());
}
namespace {

View File

@@ -626,7 +626,7 @@ modification_statement::prepare(data_dictionary::database db, prepare_context& c
// Since this cache is only meaningful for LWT queries, just clear the ids
// if it's not a conditional statement so that the AST nodes don't
// participate in the caching mechanism later.
if (!prepared_stmt->has_conditions() && prepared_stmt->_restrictions.has_value()) {
if (!prepared_stmt->has_conditions() && prepared_stmt->_restrictions) {
ctx.clear_pk_function_calls_cache();
}
prepared_stmt->_may_use_token_aware_routing = ctx.get_partition_key_bind_indexes(*schema).size() != 0;

View File

@@ -94,7 +94,7 @@ private:
std::optional<bool> _is_raw_counter_shard_write;
protected:
std::optional<restrictions::statement_restrictions> _restrictions;
shared_ptr<const restrictions::statement_restrictions> _restrictions;
public:
typedef std::optional<std::unordered_map<sstring, bytes_opt>> json_cache_opt;

View File

@@ -19,7 +19,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,

View File

@@ -109,7 +109,7 @@ public:
std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg, bool for_view);
private:
std::vector<selection::prepared_selector> maybe_jsonize_select_clause(std::vector<selection::prepared_selector> select, data_dictionary::database db, schema_ptr schema);
::shared_ptr<restrictions::statement_restrictions> prepare_restrictions(
::shared_ptr<const restrictions::statement_restrictions> prepare_restrictions(
data_dictionary::database db,
schema_ptr schema,
prepare_context& ctx,

View File

@@ -1027,7 +1027,7 @@ view_indexed_table_select_statement::prepare(data_dictionary::database db,
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -1139,7 +1139,7 @@ lw_shared_ptr<const service::pager::paging_state> view_indexed_table_select_stat
auto& last_base_pk = last_pos.partition;
auto* last_base_ck = last_pos.position.has_key() ? &last_pos.position.key() : nullptr;
bytes_opt indexed_column_value = restrictions::value_for(*cdef, _used_index_restrictions, options);
bytes_opt indexed_column_value = _restrictions->value_for_index_partition_key(options);
auto index_pk = [&]() {
if (_index.metadata().local()) {
@@ -1350,12 +1350,7 @@ dht::partition_range_vector view_indexed_table_select_statement::get_partition_r
dht::partition_range_vector view_indexed_table_select_statement::get_partition_ranges_for_global_index_posting_list(const query_options& options) const {
dht::partition_range_vector partition_ranges;
const column_definition* cdef = _schema->get_column_definition(to_bytes(_index.target_column()));
if (!cdef) {
throw exceptions::invalid_request_exception("Indexed column not found in schema");
}
bytes_opt value = restrictions::value_for(*cdef, _used_index_restrictions, options);
bytes_opt value = _restrictions->value_for_index_partition_key(options);
if (value) {
auto pk = partition_key::from_single_value(*_view_schema, *value);
auto dk = dht::decorate_key(*_view_schema, pk);
@@ -1374,11 +1369,11 @@ query::partition_slice view_indexed_table_select_statement::get_partition_slice_
// Only EQ restrictions on base partition key can be used in an index view query
if (pk_restrictions_is_single && _restrictions->partition_key_restrictions_is_all_eq()) {
partition_slice_builder.with_ranges(
_restrictions->get_global_index_clustering_ranges(options, *_view_schema));
_restrictions->get_global_index_clustering_ranges(options));
} else if (_restrictions->has_token_restrictions()) {
// Restrictions like token(p1, p2) < 0 have all partition key components restricted, but require special handling.
partition_slice_builder.with_ranges(
_restrictions->get_global_index_token_clustering_ranges(options, *_view_schema));
_restrictions->get_global_index_token_clustering_ranges(options));
}
}
@@ -1389,7 +1384,7 @@ query::partition_slice view_indexed_table_select_statement::get_partition_slice_
partition_slice_builder partition_slice_builder{*_view_schema};
partition_slice_builder.with_ranges(
_restrictions->get_local_index_clustering_ranges(options, *_view_schema));
_restrictions->get_local_index_clustering_ranges(options));
return partition_slice_builder.build();
}
@@ -1607,7 +1602,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -1645,7 +1640,7 @@ private:
uint32_t bound_terms,
lw_shared_ptr<const select_statement::parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
parallelized_select_statement::ordering_comparator_type ordering_comparator,
@@ -2076,7 +2071,7 @@ static select_statement::ordering_comparator_type get_similarity_ordering_compar
::shared_ptr<cql3::statements::select_statement> vector_indexed_table_select_statement::prepare(data_dictionary::database db, schema_ptr schema,
uint32_t bound_terms, lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
::shared_ptr<const restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<attributes> attrs) {
@@ -2589,7 +2584,7 @@ std::unique_ptr<prepared_statement> select_statement::prepare(data_dictionary::d
return make_unique<prepared_statement>(audit_info(), std::move(stmt), ctx, std::move(partition_key_bind_indices), std::move(warnings));
}
::shared_ptr<restrictions::statement_restrictions>
::shared_ptr<const restrictions::statement_restrictions>
select_statement::prepare_restrictions(data_dictionary::database db,
schema_ptr schema,
prepare_context& ctx,
@@ -2599,8 +2594,8 @@ select_statement::prepare_restrictions(data_dictionary::database db,
restrictions::check_indexes do_check_indexes)
{
try {
return ::make_shared<restrictions::statement_restrictions>(restrictions::analyze_statement_restrictions(db, schema, statement_type::SELECT, _where_clause, ctx,
selection->contains_only_static_columns(), for_view, allow_filtering, do_check_indexes));
return restrictions::analyze_statement_restrictions(db, schema, statement_type::SELECT, _where_clause, ctx,
selection->contains_only_static_columns(), for_view, allow_filtering, do_check_indexes);
} catch (const exceptions::unrecognized_entity_exception& e) {
if (contains_alias(e.entity)) {
throw exceptions::invalid_request_exception(format("Aliases aren't allowed in the WHERE clause (name: '{}')", e.entity));

View File

@@ -200,7 +200,7 @@ public:
uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters,
::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions,
::shared_ptr<const restrictions::statement_restrictions> restrictions,
::shared_ptr<std::vector<size_t>> group_by_cell_indices,
bool is_reversed,
ordering_comparator_type ordering_comparator,
@@ -372,7 +372,7 @@ public:
static ::shared_ptr<cql3::statements::select_statement> prepare(data_dictionary::database db, schema_ptr schema, uint32_t bound_terms,
lw_shared_ptr<const parameters> parameters, ::shared_ptr<selection::selection> selection,
::shared_ptr<restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
::shared_ptr<const restrictions::statement_restrictions> restrictions, ::shared_ptr<std::vector<size_t>> group_by_cell_indices, bool is_reversed,
ordering_comparator_type ordering_comparator, prepared_ann_ordering_type prepared_ann_ordering, std::optional<expr::expression> limit,
std::optional<expr::expression> per_partition_limit, cql_stats& stats, const secondary_index::index& index, std::unique_ptr<cql3::attributes> attrs);

View File

@@ -66,7 +66,7 @@ public:
: update_statement(std::move(audit_info), statement_type::INSERT, bound_terms, s, std::move(attrs), stats)
, _value(std::move(v))
, _default_unset(default_unset) {
_restrictions = restrictions::statement_restrictions(s, false);
_restrictions = cql3::restrictions::make_trivial_statement_restrictions(s, false);
}
private:
virtual void execute_operations_for_key(mutation& m, const clustering_key_prefix& prefix, const update_parameters& params, const json_cache_opt& json_cache) const override;

View File

@@ -224,10 +224,12 @@ keyspace_metadata::keyspace_metadata(std::string_view name,
bool durable_writes,
std::vector<schema_ptr> cf_defs,
user_types_metadata user_types,
storage_options storage_opts)
storage_options storage_opts,
std::optional<locator::replication_strategy_config_options> next_options)
: _name{name}
, _strategy_name{locator::abstract_replication_strategy::to_qualified_class_name(strategy_name.empty() ? "NetworkTopologyStrategy" : strategy_name)}
, _strategy_options{std::move(strategy_options)}
, _next_strategy_options{std::move(next_options)}
, _initial_tablets(initial_tablets)
, _durable_writes{durable_writes}
, _user_types{std::move(user_types)}
@@ -273,14 +275,15 @@ keyspace_metadata::new_keyspace(std::string_view name,
std::optional<consistency_config_option> consistency_option,
bool durables_writes,
storage_options storage_opts,
std::vector<schema_ptr> cf_defs)
std::vector<schema_ptr> cf_defs,
std::optional<locator::replication_strategy_config_options> next_options)
{
return ::make_lw_shared<keyspace_metadata>(name, strategy_name, options, initial_tablets, consistency_option, durables_writes, cf_defs, user_types_metadata{}, storage_opts);
return ::make_lw_shared<keyspace_metadata>(name, strategy_name, options, initial_tablets, consistency_option, durables_writes, cf_defs, user_types_metadata{}, storage_opts, next_options);
}
lw_shared_ptr<keyspace_metadata>
keyspace_metadata::new_keyspace(const keyspace_metadata& ksm) {
return new_keyspace(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(), ksm.get_storage_options());
return new_keyspace(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(), ksm.get_storage_options(), {}, ksm.next_strategy_options_opt());
}
void keyspace_metadata::add_user_type(const user_type ut) {
@@ -336,7 +339,7 @@ static storage_options::object_storage object_storage_from_map(std::string_view
}
if (values.size() > allowed_options.size()) {
throw std::runtime_error(fmt::format("Extraneous options for {}: {}; allowed: {}",
fmt::join(values | std::views::keys, ","), type,
type, fmt::join(values | std::views::keys, ","),
fmt::join(allowed_options | std::views::keys, ",")));
}
options.type = std::string(type);
@@ -649,8 +652,8 @@ struct fmt::formatter<data_dictionary::user_types_metadata> {
};
auto fmt::formatter<data_dictionary::keyspace_metadata>::format(const data_dictionary::keyspace_metadata& m, fmt::format_context& ctx) const -> decltype(ctx.out()) {
fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, cfMetaData={}, durable_writes={}, tablets=",
m.name(), m.strategy_name(), m.strategy_options(), m.cf_meta_data(), m.durable_writes());
fmt::format_to(ctx.out(), "KSMetaData{{name={}, strategyClass={}, strategyOptions={}, nextStrategyOptions={}, cfMetaData={}, durable_writes={}, tablets=",
m.name(), m.strategy_name(), m.strategy_options(), m.next_strategy_options_opt(), m.cf_meta_data(), m.durable_writes());
if (m.initial_tablets()) {
if (auto initial_tablets = m.initial_tablets().value()) {
fmt::format_to(ctx.out(), "{{\"initial\":{}}}", initial_tablets);

View File

@@ -28,7 +28,9 @@ namespace data_dictionary {
class keyspace_metadata final {
sstring _name;
sstring _strategy_name;
// If _next_strategy_options has value, there is ongoing rf change of this keyspace.
locator::replication_strategy_config_options _strategy_options;
std::optional<locator::replication_strategy_config_options> _next_strategy_options;
std::optional<unsigned> _initial_tablets;
std::unordered_map<sstring, schema_ptr> _cf_meta_data;
bool _durable_writes;
@@ -44,7 +46,8 @@ public:
bool durable_writes,
std::vector<schema_ptr> cf_defs = std::vector<schema_ptr>{},
user_types_metadata user_types = user_types_metadata{},
storage_options storage_opts = storage_options{});
storage_options storage_opts = storage_options{},
std::optional<locator::replication_strategy_config_options> next_options = std::nullopt);
static lw_shared_ptr<keyspace_metadata>
new_keyspace(std::string_view name,
std::string_view strategy_name,
@@ -53,7 +56,8 @@ public:
std::optional<consistency_config_option> consistency_option,
bool durables_writes = true,
storage_options storage_opts = {},
std::vector<schema_ptr> cf_defs = {});
std::vector<schema_ptr> cf_defs = {},
std::optional<locator::replication_strategy_config_options> next_options = std::nullopt);
static lw_shared_ptr<keyspace_metadata>
new_keyspace(const keyspace_metadata& ksm);
void validate(const gms::feature_service&, const locator::topology&) const;
@@ -66,6 +70,18 @@ public:
const locator::replication_strategy_config_options& strategy_options() const {
return _strategy_options;
}
void set_strategy_options(const locator::replication_strategy_config_options& options) {
_strategy_options = options;
}
const std::optional<locator::replication_strategy_config_options>& next_strategy_options_opt() const {
return _next_strategy_options;
}
void set_next_strategy_options(const locator::replication_strategy_config_options& options) {
_next_strategy_options = options;
}
void clear_next_strategy_options() {
_next_strategy_options = std::nullopt;
}
locator::replication_strategy_config_options strategy_options_v1() const;
std::optional<unsigned> initial_tablets() const {
return _initial_tablets;

View File

@@ -776,7 +776,7 @@ class db::commitlog::segment : public enable_shared_from_this<segment>, public c
friend std::ostream& operator<<(std::ostream&, const segment&);
friend class segment_manager;
size_t sector_overhead(size_t size) const {
constexpr size_t sector_overhead(size_t size) const {
return (size / (_alignment - detail::sector_overhead_size)) * detail::sector_overhead_size;
}
@@ -1028,18 +1028,21 @@ public:
co_return me;
}
/**
* Allocate a new buffer
*/
void new_buffer(size_t s) {
SCYLLA_ASSERT(_buffer.empty());
std::tuple<size_t, size_t> buffer_usage_size(size_t s) const {
auto overhead = segment_overhead_size;
if (_file_pos == 0) {
overhead += descriptor_header_size;
}
s += overhead;
return {s + overhead, overhead};
}
/**
* Allocate a new buffer
*/
void new_buffer(size_t size_in) {
SCYLLA_ASSERT(_buffer.empty());
auto [s, overhead] = buffer_usage_size(size_in);
// add bookkeep data reqs.
auto a = align_up(s + sector_overhead(s), _alignment);
auto k = std::max(a, default_size);
@@ -1427,6 +1430,9 @@ public:
position_type next_position(size_t size) const {
auto used = _buffer_ostream_size - _buffer_ostream.size();
if (used == 0) { // new chunk/segment
std::tie(size, std::ignore) = buffer_usage_size(size);
}
used += size;
return _file_pos + used + sector_overhead(used);
}
@@ -1570,7 +1576,6 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
clogger.debug("Attempting oversized alloc of {} entry writer", writer.num_entries);
auto size = writer.size();
auto max_file_size = cfg.commitlog_segment_size_in_mb * 1024 * 1024;
// check if this cannot be written at all...
if (!cfg.allow_going_over_size_limit) {
@@ -1579,11 +1584,11 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
// more worst case
auto size_with_meta_overhead = size_with_sector_overhead
+ (1 + size_with_sector_overhead/max_mutation_size) * (segment::entry_overhead_size + segment::fragmented_entry_overhead_size + segment::segment_overhead_size)
* (1 + size_with_sector_overhead/max_file_size) * segment::descriptor_header_size
* (1 + size_with_sector_overhead/max_size) * segment::descriptor_header_size
;
// this is not really true. We could have some space in current segment,
// but again, lets be conservative.
auto max_file_size_avail = max_disk_size - max_file_size;
auto max_file_size_avail = max_disk_size - max_size;
if (size_with_meta_overhead > max_file_size_avail) {
throw std::invalid_argument(fmt::format("Mutation of {} bytes is too large for potentially available disk space of {}", size, max_file_size_avail));
@@ -1770,11 +1775,13 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
co_await s->close();
s = co_await get_segment();
}
// bytes not counting overhead
auto buf_rem = std::min(max_size - s->position(), s->_buffer_ostream.size());
// bytes not counting overhead
auto pos = s->position();
auto max = std::max<size_t>(pos, max_size);
auto buf_rem = std::min(max_size - max, s->_buffer_ostream.size());
size_t avail;
if (buf_rem > align) {
if (buf_rem >= align) {
auto rem2 = buf_rem - (1 + buf_rem/sector_size) * detail::sector_overhead_size;
avail = std::min(rem2, max_mutation_size)
- segment::entry_overhead_size
@@ -1784,7 +1791,7 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
} else {
co_await s->cycle();
auto pos = s->position();
auto max = std::max<size_t>(pos, max_file_size);
auto max = std::max<size_t>(pos, max_size);
auto file_rem = max - pos;
if (file_rem < align) {

View File

@@ -217,7 +217,7 @@ future<> db::commitlog_replayer::impl::process(stats* s, commitlog::buffer_and_r
if (cm_it == local_cm.end()) {
if (!cer.get_column_mapping()) {
rlogger.debug("replaying at {} v={} at {}", fm.column_family_id(), fm.schema_version(), rp);
throw std::runtime_error(format("unknown schema version {}, table=", fm.schema_version(), fm.column_family_id()));
throw std::runtime_error(format("unknown schema version {}, table={}", fm.schema_version(), fm.column_family_id()));
}
rlogger.debug("new schema version {} in entry {}", fm.schema_version(), rp);
cm_it = local_cm.emplace(fm.schema_version(), *cer.get_column_mapping()).first;

View File

@@ -1921,7 +1921,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"lwt", feature::UNUSED},
{"udf", feature::UDF},
{"cdc", feature::UNUSED},
{"alternator-streams", feature::ALTERNATOR_STREAMS},
{"alternator-streams", feature::UNUSED},
{"alternator-ttl", feature::UNUSED },
{"consistent-topology-changes", feature::UNUSED},
{"broadcast-tables", feature::BROADCAST_TABLES},

View File

@@ -115,7 +115,6 @@ struct experimental_features_t {
enum class feature {
UNUSED,
UDF,
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
STRONGLY_CONSISTENT_TABLES,

View File

@@ -277,7 +277,7 @@ filter_for_query(consistency_level cl,
host_id_vector_replica_set selected_endpoints;
// Pre-select endpoints based on client preference. If the endpoints
// Preselect endpoints based on client preference. If the endpoints
// selected this way aren't enough to satisfy CL requirements select the
// remaining ones according to the load-balancing strategy as before.
if (!preferred_endpoints.empty()) {

View File

@@ -327,7 +327,7 @@ redistribute(const std::vector<float>& p, unsigned me, unsigned k) {
}
}
hr_logger.trace(" pp after1=", pp);
hr_logger.trace(" pp after1={}", pp);
if (d.first == me) {
// We only care what "me" sends, and only the elements in
// the sorted list earlier than me could have forced it to

View File

@@ -33,6 +33,11 @@ enum class schema_feature {
// Per-table tablet options
TABLET_OPTIONS,
// When enabled, `system_schema.keyspaces` will keep three replication values:
// the initial, the current, and the target replication factor,
// which reflect the phases of the multi RF change.
KEYSPACE_MULTI_RF_CHANGE,
};
using schema_features = enum_set<super_enum<schema_feature,
@@ -43,7 +48,8 @@ using schema_features = enum_set<super_enum<schema_feature,
schema_feature::TABLE_DIGEST_INSENSITIVE_TO_EXPIRY,
schema_feature::GROUP0_SCHEMA_VERSIONING,
schema_feature::IN_MEMORY_TABLES,
schema_feature::TABLET_OPTIONS
schema_feature::TABLET_OPTIONS,
schema_feature::KEYSPACE_MULTI_RF_CHANGE
>>;
}

View File

@@ -216,6 +216,7 @@ schema_ptr keyspaces() {
{"durable_writes", boolean_type},
{"replication", map_type_impl::get_instance(utf8_type, utf8_type, false)},
{"replication_v2", map_type_impl::get_instance(utf8_type, utf8_type, false)}, // with rack list RF
{"next_replication", map_type_impl::get_instance(utf8_type, utf8_type, false)}, // target rack list RF for this RF change
},
// static columns
{},
@@ -1178,6 +1179,14 @@ utils::chunked_vector<mutation> make_create_keyspace_mutations(schema_features f
// If the maps are different, the upgrade must be already done.
store_map(m, ckey, "replication_v2", timestamp, cql3::statements::to_flattened_map(map));
}
if (features.contains<schema_feature::KEYSPACE_MULTI_RF_CHANGE>()) {
const auto& next_map_opt = keyspace->next_strategy_options_opt();
if (next_map_opt) {
auto next_map = *next_map_opt;
next_map["class"] = keyspace->strategy_name();
store_map(m, ckey, "next_replication", timestamp, cql3::statements::to_flattened_map(next_map));
}
}
if (features.contains<schema_feature::SCYLLA_KEYSPACES>()) {
schema_ptr scylla_keyspaces_s = scylla_keyspaces();
@@ -1251,6 +1260,7 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
// (or screw up shared pointers)
const auto& replication = row.get_nonnull<map_type_impl::native_type>("replication");
const auto& replication_v2 = row.get<map_type_impl::native_type>("replication_v2");
const auto& next_replication = row.get<map_type_impl::native_type>("next_replication");
cql3::statements::property_definitions::map_type flat_strategy_options;
for (auto& p : replication_v2 ? *replication_v2 : replication) {
@@ -1259,6 +1269,17 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
auto strategy_options = cql3::statements::from_flattened_map(flat_strategy_options);
auto strategy_name = std::get<sstring>(strategy_options["class"]);
strategy_options.erase("class");
std::optional<cql3::statements::property_definitions::extended_map_type> next_strategy_options = std::nullopt;
if (next_replication) {
cql3::statements::property_definitions::map_type flat_next_replication;
for (auto& p : *next_replication) {
flat_next_replication.emplace(value_cast<sstring>(p.first), value_cast<sstring>(p.second));
}
next_strategy_options = cql3::statements::from_flattened_map(flat_next_replication);
next_strategy_options->erase("class");
}
bool durable_writes = row.get_nonnull<bool>("durable_writes");
data_dictionary::storage_options storage_opts;
@@ -1284,7 +1305,7 @@ future<lw_shared_ptr<keyspace_metadata>> create_keyspace_metadata(
}
}
}
co_return keyspace_metadata::new_keyspace(keyspace_name, strategy_name, strategy_options, initial_tablets, consistency, durable_writes, storage_opts);
co_return keyspace_metadata::new_keyspace(keyspace_name, strategy_name, strategy_options, initial_tablets, consistency, durable_writes, storage_opts, {}, next_strategy_options);
}
template<typename V>

View File

@@ -13,7 +13,6 @@
#include "replica/database.hh"
#include "db/consistency_level_type.hh"
#include "db/system_keyspace.hh"
#include "db/config.hh"
#include "schema/schema_builder.hh"
#include "timeout_config.hh"
#include "types/types.hh"
@@ -22,8 +21,6 @@
#include "cdc/generation.hh"
#include "cql3/query_processor.hh"
#include "service/storage_proxy.hh"
#include "gms/feature_service.hh"
#include "service/migration_manager.hh"
#include "locator/host_id.hh"
@@ -41,27 +38,10 @@ static logging::logger dlogger("system_distributed_keyspace");
extern logging::logger cdc_log;
namespace db {
namespace {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
{
builder.set_wait_for_sync_to_commitlog(true);
}
});
}
extern thread_local data_type cdc_streams_set_type;
thread_local data_type cdc_streams_set_type = set_type_impl::get_instance(bytes_type, false);
/* See `token_range_description` struct */
thread_local data_type cdc_streams_list_type = list_type_impl::get_instance(bytes_type, false);
thread_local data_type cdc_token_range_description_type = tuple_type_impl::get_instance(
{ long_type // dht::token token_range_end;
, cdc_streams_list_type // std::vector<stream_id> streams;
, byte_type // uint8_t sharding_ignore_msb;
});
thread_local data_type cdc_generation_description_type = list_type_impl::get_instance(cdc_token_range_description_type, false);
schema_ptr view_build_status() {
static thread_local auto schema = [] {
@@ -77,42 +57,6 @@ schema_ptr view_build_status() {
return schema;
}
/* An internal table used by nodes to exchange CDC generation data. */
schema_ptr cdc_generations_v2() {
thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2);
return schema_builder(system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2, {id})
/* The unique identifier of this generation. */
.with_column("id", uuid_type, column_kind::partition_key)
/* The generation describes a mapping from all tokens in the token ring to a set of stream IDs.
* This mapping is built from a bunch of smaller mappings, each describing how tokens in a subrange
* of the token ring are mapped to stream IDs; these subranges together cover the entire token ring.
* Each such range-local mapping is represented by a row of this table.
* The clustering key of the row is the end of the range being described by this row.
* The start of this range is the range_end of the previous row (in the clustering order, which is the integer order)
* or of the last row of this partition if this is the first the first row. */
.with_column("range_end", long_type, column_kind::clustering_key)
/* The set of streams mapped to in this range.
* The number of streams mapped to a single range in a CDC generation is bounded from above by the number
* of shards on the owner of that range in the token ring.
* In other words, the number of elements of this set is bounded by the maximum of the number of shards
* over all nodes. The serialized size is obtained by counting about 20B for each stream.
* For example, if all nodes in the cluster have at most 128 shards,
* the serialized size of this set will be bounded by ~2.5 KB. */
.with_column("streams", cdc_streams_set_type)
/* The value of the `ignore_msb` sharding parameter of the node which was the owner of this token range
* when the generation was first created. Together with the set of streams above it fully describes
* the mapping for this particular range. */
.with_column("ignore_msb", byte_type)
/* Column used for sanity checking.
* For a given generation it's equal to the number of ranges in this generation;
* thus, after the generation is fully inserted, it must be equal to the number of rows in the partition. */
.with_column("num_ranges", int32_type, column_kind::static_column)
.with_hash_version()
.build();
}();
return schema;
}
/* A user-facing table providing identifiers of the streams used in CDC generations. */
schema_ptr cdc_desc() {
@@ -152,23 +96,6 @@ schema_ptr cdc_timestamps() {
static const sstring CDC_TIMESTAMPS_KEY = "timestamps";
schema_ptr service_levels() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS);
auto builder = schema_builder(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS, std::make_optional(id))
.with_column("service_level", utf8_type, column_kind::partition_key)
.with_column("shares", int32_type);
if (utils::get_local_injector().is_enabled("service_levels_v1_table_without_shares")) {
builder.remove_column("shares");
}
return builder
.with_hash_version()
.build();
}();
return schema;
}
// This is the set of tables which this node ensures to exist in the cluster.
// It does that by announcing the creation of these schemas on initialization
// of the `system_distributed_keyspace` service (see `start()`), unless it first
@@ -182,19 +109,13 @@ schema_ptr service_levels() {
static std::vector<schema_ptr> ensured_tables() {
return {
view_build_status(),
cdc_generations_v2(),
cdc_desc(),
cdc_timestamps(),
service_levels(),
};
}
std::vector<schema_ptr> system_distributed_keyspace::all_distributed_tables() {
return {view_build_status(), cdc_desc(), cdc_timestamps(), service_levels()};
}
std::vector<schema_ptr> system_distributed_keyspace::all_everywhere_tables() {
return {cdc_generations_v2()};
return {view_build_status(), cdc_desc(), cdc_timestamps()};
}
system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor& qp, service::migration_manager& mm, service::storage_proxy& sp)
@@ -203,36 +124,6 @@ system_distributed_keyspace::system_distributed_keyspace(cql3::query_processor&
, _sp(sp) {
}
static std::vector<std::pair<std::string_view, data_type>> new_service_levels_columns(bool workload_prioritization_enabled) {
std::vector<std::pair<std::string_view, data_type>> new_columns {{"timeout", duration_type}, {"workload_type", utf8_type}};
if (workload_prioritization_enabled) {
new_columns.push_back({"shares", int32_type});
}
return new_columns;
};
static schema_ptr get_current_service_levels(data_dictionary::database db) {
return db.has_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS)
? db.find_schema(system_distributed_keyspace::NAME, system_distributed_keyspace::SERVICE_LEVELS)
: service_levels();
}
static schema_ptr get_updated_service_levels(data_dictionary::database db, bool workload_prioritization_enabled) {
SCYLLA_ASSERT(this_shard_id() == 0);
auto schema = get_current_service_levels(db);
schema_builder b(schema);
for (const auto& col : new_service_levels_columns(workload_prioritization_enabled)) {
auto& [col_name, col_type] = col;
bytes options_name = to_bytes(col_name.data());
if (schema->get_column_definition(options_name)) {
continue;
}
b.with_column(options_name, col_type, column_kind::regular_column);
}
b.with_hash_version();
return b.build();
}
future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tables) {
if (this_shard_id() != 0) {
_started = true;
@@ -243,11 +134,9 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
while (true) {
// Check if there is any work to do before taking the group 0 guard.
bool workload_prioritization_enabled = _sp.features().workload_prioritization;
bool keyspaces_setup = db.has_keyspace(NAME) && db.has_keyspace(NAME_EVERYWHERE);
bool keyspaces_setup = db.has_keyspace(NAME);
bool tables_setup = std::all_of(tables.begin(), tables.end(), [db] (schema_ptr t) { return db.has_schema(t->ks_name(), t->cf_name()); } );
bool service_levels_up_to_date = get_current_service_levels(db)->equal_columns(*get_updated_service_levels(db, workload_prioritization_enabled));
if (keyspaces_setup && tables_setup && service_levels_up_to_date) {
if (keyspaces_setup && tables_setup) {
dlogger.info("system_distributed(_everywhere) keyspaces and tables are up-to-date. Not creating");
_started = true;
co_return;
@@ -258,51 +147,25 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
utils::chunked_vector<mutation> mutations;
sstring description;
auto sd_ksm = keyspace_metadata::new_keyspace(
auto ksm = keyspace_metadata::new_keyspace(
NAME,
"org.apache.cassandra.locator.SimpleStrategy",
{{"replication_factor", "3"}},
std::nullopt, std::nullopt);
if (!db.has_keyspace(NAME)) {
mutations = service::prepare_new_keyspace_announcement(db.real_database(), sd_ksm, ts);
mutations = service::prepare_new_keyspace_announcement(db.real_database(), ksm, ts);
description += format(" create {} keyspace;", NAME);
} else {
dlogger.info("{} keyspace is already present. Not creating", NAME);
}
auto sde_ksm = keyspace_metadata::new_keyspace(
NAME_EVERYWHERE,
"org.apache.cassandra.locator.EverywhereStrategy",
{},
std::nullopt, std::nullopt);
if (!db.has_keyspace(NAME_EVERYWHERE)) {
auto sde_mutations = service::prepare_new_keyspace_announcement(db.real_database(), sde_ksm, ts);
std::move(sde_mutations.begin(), sde_mutations.end(), std::back_inserter(mutations));
description += format(" create {} keyspace;", NAME_EVERYWHERE);
} else {
dlogger.info("{} keyspace is already present. Not creating", NAME_EVERYWHERE);
}
// Get mutations for creating and updating tables.
// Get mutations for creating tables.
auto num_keyspace_mutations = mutations.size();
co_await coroutine::parallel_for_each(ensured_tables(),
[this, &mutations, db, ts, sd_ksm, sde_ksm, workload_prioritization_enabled] (auto&& table) -> future<> {
auto ksm = table->ks_name() == NAME ? sd_ksm : sde_ksm;
// Ensure that the service_levels table contains new columns.
if (table->cf_name() == SERVICE_LEVELS) {
table = get_updated_service_levels(db, workload_prioritization_enabled);
}
[this, &mutations, db, ts, ksm] (auto&& table) -> future<> {
if (!db.has_schema(table->ks_name(), table->cf_name())) {
co_return co_await service::prepare_new_column_family_announcement(mutations, _sp, *ksm, std::move(table), ts);
}
// The service_levels table exists. Update it if it lacks new columns.
if (table->cf_name() == SERVICE_LEVELS && !get_current_service_levels(db)->equal_columns(*table)) {
auto update_mutations = co_await service::prepare_column_family_update_announcement(_sp, table, std::vector<view_ptr>(), ts);
std::move(update_mutations.begin(), update_mutations.end(), std::back_inserter(mutations));
}
});
if (mutations.size() > num_keyspace_mutations) {
description += " create and update system_distributed(_everywhere) tables";
@@ -324,15 +187,6 @@ future<> system_distributed_keyspace::create_tables(std::vector<schema_ptr> tabl
}
}
future<> system_distributed_keyspace::start_workload_prioritization() {
if (this_shard_id() != 0) {
co_return;
}
if (_qp.db().features().workload_prioritization) {
co_await create_tables({get_updated_service_levels(_qp.db(), true)});
}
}
future<> system_distributed_keyspace::start() {
if (this_shard_id() != 0) {
_started = true;
@@ -375,90 +229,6 @@ static db::consistency_level quorum_if_many(size_t num_token_owners) {
return num_token_owners > 1 ? db::consistency_level::QUORUM : db::consistency_level::ONE;
}
future<>
system_distributed_keyspace::insert_cdc_generation(
utils::UUID id,
const cdc::topology_description& desc,
context ctx) {
using namespace std::chrono_literals;
const size_t concurrency = 10;
const size_t num_replicas = ctx.num_token_owners;
// To insert the data quickly and efficiently we send it in batches of multiple rows
// (each batch represented by a single mutation). We also send multiple such batches concurrently.
// However, we need to limit the memory consumption of the operation.
// I assume that the memory consumption grows linearly with the number of replicas
// (we send to all replicas ``at the same time''), with the batch size (the data must
// be copied for each replica?) and with concurrency. These assumptions may be too conservative
// but that won't hurt in a significant way (it may hurt the efficiency of the operation a little).
// Thus, if we want to limit the memory consumption to L, it should be true that
// mutation_size * num_replicas * concurrency <= L, hence
// mutation_size <= L / (num_replicas * concurrency).
// For example, say L = 10MB, concurrency = 10, num_replicas = 100; we get
// mutation_size <= 10MB / 1000 = 10KB.
// On the other hand we must have mutation_size >= size of a single row,
// so we will use mutation_size <= max(size of single row, L/(num_replicas*concurrency)).
// It has been tested that sending 1MB batches to 3 replicas with concurrency 20 works OK,
// which would correspond to L ~= 60MB. Hence that's the limit we use here.
const size_t L = 60'000'000;
const auto mutation_size_threshold = std::max(size_t(1), L / (num_replicas * concurrency));
auto s = _qp.db().real_database().find_schema(
system_distributed_keyspace::NAME_EVERYWHERE, system_distributed_keyspace::CDC_GENERATIONS_V2);
auto ms = co_await cdc::get_cdc_generation_mutations_v2(s, id, desc, mutation_size_threshold, api::new_timestamp());
co_await max_concurrent_for_each(ms, concurrency, [&] (mutation& m) -> future<> {
co_await _sp.mutate(
{ std::move(m) },
db::consistency_level::ALL,
db::timeout_clock::now() + 60s,
nullptr, // trace_state
empty_service_permit(),
db::allow_per_partition_rate_limit::no,
false // raw_counters
);
});
}
future<std::optional<cdc::topology_description>>
system_distributed_keyspace::read_cdc_generation(utils::UUID id) {
utils::chunked_vector<cdc::token_range_description> entries;
size_t num_ranges = 0;
co_await _qp.query_internal(
// This should be a local read so 20s should be more than enough
format("SELECT range_end, streams, ignore_msb, num_ranges FROM {}.{} WHERE id = ? USING TIMEOUT 20s", NAME_EVERYWHERE, CDC_GENERATIONS_V2),
db::consistency_level::ONE, // we wrote the generation with ALL so ONE must see it (or there's something really wrong)
{ id },
1000, // for ~1KB rows, ~1MB page size
[&] (const cql3::untyped_result_set_row& row) {
std::vector<cdc::stream_id> streams;
row.get_list_data<bytes>("streams", std::back_inserter(streams));
entries.push_back(cdc::token_range_description{
dht::token::from_int64(row.get_as<int64_t>("range_end")),
std::move(streams),
uint8_t(row.get_as<int8_t>("ignore_msb"))});
num_ranges = row.get_as<int32_t>("num_ranges");
return make_ready_future<stop_iteration>(stop_iteration::no);
});
if (entries.empty()) {
co_return std::nullopt;
}
// Paranoic sanity check. Partial reads should not happen since generations should be retrieved only after they
// were written successfully with CL=ALL. But nobody uses EverywhereStrategy tables so they weren't ever properly
// tested, so just in case...
if (entries.size() != num_ranges) {
throw std::runtime_error(format(
"read_cdc_generation: wrong number of rows. The `num_ranges` column claimed {} rows,"
" but reading the partition returned {}.", num_ranges, entries.size()));
}
co_return std::optional{cdc::topology_description(std::move(entries))};
}
static future<utils::chunked_vector<mutation>> get_cdc_streams_descriptions_v2_mutation(
const replica::database& db,
db_clock::time_point time,
@@ -630,65 +400,4 @@ system_distributed_keyspace::cdc_current_generation_timestamp(context ctx) {
co_return timestamp_cql->one().get_as<db_clock::time_point>("time");
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_levels(qos::query_context ctx) const {
return qos::get_service_levels(_qp, NAME, SERVICE_LEVELS, db::consistency_level::ONE, ctx);
}
future<qos::service_levels_info> system_distributed_keyspace::get_service_level(sstring service_level_name) const {
return qos::get_service_level(_qp, NAME, SERVICE_LEVELS, service_level_name, db::consistency_level::ONE);
}
future<> system_distributed_keyspace::set_service_level(sstring service_level_name, qos::service_level_options slo) const {
static sstring prepared_query = format("INSERT INTO {}.{} (service_level) VALUES (?);", NAME, SERVICE_LEVELS);
co_await _qp.execute_internal(prepared_query, db::consistency_level::ONE, internal_distributed_query_state(), {service_level_name}, cql3::query_processor::cache_internal::no);
auto to_data_value = [&] (const qos::service_level_options::timeout_type& tv) {
return std::visit(overloaded_functor {
[&] (const qos::service_level_options::unset_marker&) {
return data_value::make_null(duration_type);
},
[&] (const qos::service_level_options::delete_marker&) {
return data_value::make_null(duration_type);
},
[&] (const lowres_clock::duration& d) {
return data_value(cql_duration(months_counter{0},
days_counter{0},
nanoseconds_counter{std::chrono::duration_cast<std::chrono::nanoseconds>(d).count()}));
},
}, tv);
};
auto to_data_value_g = [&] <typename T> (const std::variant<qos::service_level_options::unset_marker, qos::service_level_options::delete_marker, T>& v) {
return std::visit(overloaded_functor {
[&] (const qos::service_level_options::unset_marker&) {
return data_value::make_null(data_type_for<T>());
},
[&] (const qos::service_level_options::delete_marker&) {
return data_value::make_null(data_type_for<T>());
},
[&] (const T& v) {
return data_value(v);
},
}, v);
};
data_value workload = slo.workload == qos::service_level_options::workload_type::unspecified
? data_value::make_null(utf8_type)
: data_value(qos::service_level_options::to_string(slo.workload));
co_await _qp.execute_internal(format("UPDATE {}.{} SET timeout = ?, workload_type = ? WHERE service_level = ?;", NAME, SERVICE_LEVELS),
db::consistency_level::ONE,
internal_distributed_query_state(),
{to_data_value(slo.timeout),
workload,
service_level_name},
cql3::query_processor::cache_internal::no);
co_await _qp.execute_internal(format("UPDATE {}.{} SET shares = ? WHERE service_level = ?;", NAME, SERVICE_LEVELS),
db::consistency_level::ONE,
internal_distributed_query_state(),
{to_data_value_g(slo.shares), service_level_name},
cql3::query_processor::cache_internal::no);
}
future<> system_distributed_keyspace::drop_service_level(sstring service_level_name) const {
static sstring prepared_query = format("DELETE FROM {}.{} WHERE service_level= ?;", NAME, SERVICE_LEVELS);
return _qp.execute_internal(prepared_query, db::consistency_level::ONE, internal_distributed_query_state(), {service_level_name}, cql3::query_processor::cache_internal::no).discard_result();
}
}

View File

@@ -9,9 +9,6 @@
#pragma once
#include "schema/schema_fwd.hh"
#include "service/qos/qos_common.hh"
#include "utils/UUID.hh"
#include "cdc/generation_id.hh"
#include "locator/host_id.hh"
#include <seastar/core/future.hh>
@@ -24,7 +21,6 @@ class query_processor;
}
namespace cdc {
class stream_id;
class topology_description;
class streams_version;
} // namespace cdc
@@ -39,17 +35,8 @@ namespace db {
class system_distributed_keyspace {
public:
static constexpr auto NAME = "system_distributed";
static constexpr auto NAME_EVERYWHERE = "system_distributed_everywhere";
static constexpr auto VIEW_BUILD_STATUS = "view_build_status";
static constexpr auto SERVICE_LEVELS = "service_levels";
/* Nodes use this table to communicate new CDC stream generations to other nodes. */
static constexpr auto CDC_TOPOLOGY_DESCRIPTION = "cdc_generation_descriptions";
/* Nodes use this table to communicate new CDC stream generations to other nodes.
* Resides in system_distributed_everywhere. */
static constexpr auto CDC_GENERATIONS_V2 = "cdc_generation_descriptions_v2";
/* This table is used by CDC clients to learn about available CDC streams. */
static constexpr auto CDC_DESC_V2 = "cdc_streams_descriptions_v2";
@@ -77,19 +64,14 @@ private:
public:
static std::vector<schema_ptr> all_distributed_tables();
static std::vector<schema_ptr> all_everywhere_tables();
system_distributed_keyspace(cql3::query_processor&, service::migration_manager&, service::storage_proxy&);
future<> start();
future<> start_workload_prioritization();
future<> stop();
bool started() const { return _started; }
future<> insert_cdc_generation(utils::UUID, const cdc::topology_description&, context);
future<std::optional<cdc::topology_description>> read_cdc_generation(utils::UUID);
future<> create_cdc_desc(db_clock::time_point, const cdc::topology_description&, context);
future<bool> cdc_desc_exists(db_clock::time_point, context);
@@ -105,11 +87,6 @@ public:
// NOTE: currently used only by alternator
future<db_clock::time_point> cdc_current_generation_timestamp(context);
future<qos::service_levels_info> get_service_levels(qos::query_context ctx) const;
future<qos::service_levels_info> get_service_level(sstring service_level_name) const;
future<> set_service_level(sstring service_level_name, qos::service_level_options slo) const;
future<> drop_service_level(sstring service_level_name) const;
private:
future<> create_tables(std::vector<schema_ptr> tables);
};

View File

@@ -300,6 +300,7 @@ schema_ptr system_keyspace::topology() {
.with_column("upgrade_state", utf8_type, column_kind::static_column)
.with_column("global_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.with_column("paused_rf_change_requests", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.with_column("ongoing_rf_changes", set_type_impl::get_instance(timeuuid_type, true), column_kind::static_column)
.set_comment("Current state of topology change machine")
.with_hash_version()
.build();
@@ -3350,6 +3351,12 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
}
}
if (some_row.has("ongoing_rf_changes")) {
for (auto&& v : deserialize_set_column(*topology(), some_row, "ongoing_rf_changes")) {
ret.ongoing_rf_changes.insert(value_cast<utils::UUID>(v));
}
}
if (some_row.has("enabled_features")) {
ret.enabled_features = decode_features(deserialize_set_column(*topology(), some_row, "enabled_features"));
}

View File

@@ -10,6 +10,7 @@
#include "db/view/view_update_backlog.hh"
#include "utils/error_injection.hh"
#include "utils/updateable_value.hh"
#include <seastar/core/cacheline.hh>
#include <seastar/core/future.hh>
@@ -41,13 +42,16 @@ class node_update_backlog {
std::chrono::milliseconds _interval;
std::atomic<clock::time_point> _last_update;
std::atomic<update_backlog> _max;
utils::updateable_value<uint32_t> _view_flow_control_delay_limit_in_ms;
public:
explicit node_update_backlog(size_t shards, std::chrono::milliseconds interval)
explicit node_update_backlog(size_t shards, std::chrono::milliseconds interval,
utils::updateable_value<uint32_t> view_flow_control_delay_limit_in_ms = utils::updateable_value<uint32_t>(1000))
: _backlogs(shards)
, _interval(interval)
, _last_update(clock::now() - _interval)
, _max(update_backlog::no_backlog()) {
, _max(update_backlog::no_backlog())
, _view_flow_control_delay_limit_in_ms(std::move(view_flow_control_delay_limit_in_ms)) {
if (utils::get_local_injector().enter("update_backlog_immediately")) {
_interval = std::chrono::milliseconds(0);
_last_update = clock::now();
@@ -59,6 +63,9 @@ public:
update_backlog fetch_shard(unsigned shard);
seastar::future<std::optional<update_backlog>> fetch_if_changed();
std::chrono::microseconds calculate_throttling_delay(update_backlog backlog,
db::timeout_clock::time_point timeout) const;
// Exposed for testing only.
update_backlog load() const {
return _max.load(std::memory_order_relaxed);

View File

@@ -150,14 +150,14 @@ row_locker::unlock(const dht::decorated_key* pk, bool partition_exclusive,
auto pli = _two_level_locks.find(*pk);
if (pli == _two_level_locks.end()) {
// This shouldn't happen... We can't unlock this lock if we can't find it...
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for partition", *pk);
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for partition {}", *pk);
return;
}
SCYLLA_ASSERT(&pli->first == pk);
if (cpk) {
auto rli = pli->second._row_locks.find(*cpk);
if (rli == pli->second._row_locks.end()) {
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for row", *cpk);
mylog.error("column_family::local_base_lock_holder::~local_base_lock_holder() can't find lock for row {}", *cpk);
return;
}
SCYLLA_ASSERT(&rli->first == cpk);

View File

@@ -45,6 +45,7 @@
#include "db/view/view_builder.hh"
#include "db/view/view_updating_consumer.hh"
#include "db/view/view_update_generator.hh"
#include "db/view/node_view_update_backlog.hh"
#include "db/view/regular_column_transformation.hh"
#include "db/system_keyspace_view_types.hh"
#include "db/system_keyspace.hh"
@@ -1584,9 +1585,11 @@ future<stop_iteration> view_update_builder::on_results() {
auto tombstone = std::max(_update_partition_tombstone, _update_current_tombstone);
if (tombstone && _existing && !_existing->is_end_of_partition()) {
// We don't care if it's a range tombstone, as we're only looking for existing entries that get deleted
if (_existing->is_clustering_row()) {
if (_existing->is_range_tombstone_change()) {
_existing_current_tombstone = _existing->as_range_tombstone_change().tombstone();
} else if (_existing->is_clustering_row()) {
auto existing = clustering_row(*_schema, _existing->as_clustering_row());
existing.apply(std::max(_existing_partition_tombstone, _existing_current_tombstone));
auto update = clustering_row(existing.key(), row_tombstone(std::move(tombstone)), row_marker(), ::row());
generate_update(std::move(update), { std::move(existing) });
} else if (_existing->is_static_row()) {
@@ -1597,9 +1600,10 @@ future<stop_iteration> view_update_builder::on_results() {
return should_stop_updates() ? stop() : advance_existings();
}
// If we have updates and it's a range tombstone, it removes nothing pre-exisiting, so we can ignore it
if (_update && !_update->is_end_of_partition()) {
if (_update->is_clustering_row()) {
if (_update->is_range_tombstone_change()) {
_update_current_tombstone = _update->as_range_tombstone_change().tombstone();
} else if (_update->is_clustering_row()) {
_update->mutate_as_clustering_row(*_schema, [&] (clustering_row& cr) mutable {
cr.apply(std::max(_update_partition_tombstone, _update_current_tombstone));
});
@@ -3489,18 +3493,27 @@ future<> delete_ghost_rows_visitor::do_accept_new_row(partition_key pk, clusteri
}
}
std::chrono::microseconds calculate_view_update_throttling_delay(db::view::update_backlog backlog,
db::timeout_clock::time_point timeout,
uint32_t view_flow_control_delay_limit_in_ms) {
// View updates are asynchronous, and because of this limiting their concurrency requires
// a special approach. The current algorithm places all of the pending view updates in the backlog
// and artificially slows down new responses to coordinator requests based on how full the backlog is.
// This function calculates how much a request should be slowed down based on the backlog's fullness.
// The equation is basically: delay(in seconds) = view_fullness_ratio^3
// The more full the backlog gets the more aggressively the requests are slowed down.
// The delay is limited to the amount of time left until timeout.
// After the timeout the request fails, so there's no point in waiting longer than that.
// The second argument defines this timeout point - we can't delay the request more than this time point.
// See: https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/
std::chrono::microseconds node_update_backlog::calculate_throttling_delay(update_backlog backlog,
db::timeout_clock::time_point timeout) const {
auto adjust = [] (float x) { return x * x * x; };
auto budget = std::max(service::storage_proxy::clock_type::duration(0),
timeout - service::storage_proxy::clock_type::now());
std::chrono::microseconds ret(uint32_t(adjust(backlog.relative_size()) * view_flow_control_delay_limit_in_ms * 1000));
auto budget = std::max(db::timeout_clock::duration(0),
timeout - db::timeout_clock::now());
std::chrono::microseconds ret(uint32_t(adjust(backlog.relative_size()) * _view_flow_control_delay_limit_in_ms() * 1000));
// "budget" has millisecond resolution and can potentially be long
// in the future so converting it to microseconds may overflow.
// So to compare buget and ret we need to convert both to the lower
// resolution.
if (std::chrono::duration_cast<service::storage_proxy::clock_type::duration>(ret) < budget) {
if (std::chrono::duration_cast<db::timeout_clock::duration>(ret) < budget) {
return ret;
} else {
// budget is small (< ret) so can be converted to microseconds

View File

@@ -715,7 +715,7 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
vbw_logger.info("Building range {} for base table {} and views {} was aborted.", range, base_id, views_ids);
} catch (...) {
eptr = std::current_exception();
vbw_logger.warn("Error during processing range {} for base table {} and views {}: ", range, base_id, views_ids, eptr);
vbw_logger.warn("Error during processing range {} for base table {} and views {}: {}", range, base_id, views_ids, eptr);
}
reader.close().get();

View File

@@ -43,7 +43,7 @@ public:
// Returns the number of bytes in the backlog divided by the maximum number of bytes
// that the backlog can hold before employing admission control. While the backlog
// is below the threshold, the coordinator will slow down the view updates up to
// calculate_view_update_throttling_delay()::delay_limit_us. Above the threshold,
// node_update_backlog::calculate_throttling_delay()::delay_limit_us. Above the threshold,
// the coordinator will reject the writes that would increase the backlog. On the
// replica, the writes will start failing only after reaching the hard limit '_max'.
float relative_size() const {
@@ -70,18 +70,4 @@ public:
}
};
// View updates are asynchronous, and because of this limiting their concurrency requires
// a special approach. The current algorithm places all of the pending view updates in the backlog
// and artificially slows down new responses to coordinator requests based on how full the backlog is.
// This function calculates how much a request should be slowed down based on the backlog's fullness.
// The equation is basically: delay(in seconds) = view_fullness_ratio^3
// The more full the backlog gets the more aggressively the requests are slowed down.
// The delay is limited to the amount of time left until timeout.
// After the timeout the request fails, so there's no point in waiting longer than that.
// The second argument defines this timeout point - we can't delay the request more than this time point.
// See: https://www.scylladb.com/2018/12/04/worry-free-ingestion-flow-control/
std::chrono::microseconds calculate_view_update_throttling_delay(
update_backlog backlog,
db::timeout_clock::time_point timeout,
uint32_t view_flow_control_delay_limit_in_ms);
}

View File

@@ -7,6 +7,7 @@
*/
#include "db/view/view_update_backlog.hh"
#include "db/view/node_view_update_backlog.hh"
#include <seastar/core/timed_out_error.hh>
#include "gms/inet_address.hh"
#include <seastar/util/defer.hh>
@@ -95,9 +96,10 @@ public:
}
};
view_update_generator::view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, abort_source& as)
view_update_generator::view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, node_update_backlog& node_backlog, abort_source& as)
: _db(db)
, _proxy(proxy)
, _node_update_backlog(node_backlog)
, _progress_tracker(std::make_unique<progress_tracker>())
, _early_abort_subscription(as.subscribe([this] () noexcept { do_abort(); }))
{
@@ -112,7 +114,7 @@ future<> view_update_generator::start() {
_started = seastar::async([this]() mutable {
auto drop_sstable_references = defer([&] () noexcept {
// Clear sstable references so sstables_manager::stop() doesn't hang.
vug_logger.info("leaving {} unstaged sstables unprocessed",
vug_logger.info("leaving {} unstaged sstables and {} sstables with tables unprocessed",
_sstables_to_move.size(), _sstables_with_tables.size());
_sstables_to_move.clear();
_sstables_with_tables.clear();
@@ -240,6 +242,9 @@ future<> view_update_generator::process_staging_sstables(lw_shared_ptr<replica::
_progress_tracker->on_sstable_registration(sst);
}
utils::get_local_injector().inject("view_update_generator_pause_before_processing",
utils::wait_for_message(std::chrono::minutes(5))).get();
// Generate view updates from staging sstables
auto start_time = db_clock::now();
auto [result, input_size] = generate_updates_from_staging_sstables(table, sstables);
@@ -495,7 +500,7 @@ future<> view_update_generator::generate_and_propagate_view_updates(const replic
// the one which limits the number of incoming client requests by delaying the response to the client.
if (batch_num > 0) {
update_backlog local_backlog = _db.get_view_update_backlog();
std::chrono::microseconds throttle_delay = calculate_view_update_throttling_delay(local_backlog, timeout, _db.get_config().view_flow_control_delay_limit_in_ms());
std::chrono::microseconds throttle_delay = _node_update_backlog.calculate_throttling_delay(local_backlog, timeout);
co_await seastar::sleep(throttle_delay);

View File

@@ -52,6 +52,7 @@ using allow_hints = bool_class<allow_hints_tag>;
namespace db::view {
class node_update_backlog;
class stats;
struct wait_for_all_updates_tag {};
using wait_for_all_updates = bool_class<wait_for_all_updates_tag>;
@@ -63,6 +64,7 @@ public:
private:
replica::database& _db;
sharded<service::storage_proxy>& _proxy;
node_update_backlog& _node_update_backlog;
seastar::abort_source _as;
future<> _started = make_ready_future<>();
seastar::condition_variable _pending_sstables;
@@ -75,7 +77,7 @@ private:
optimized_optional<abort_source::subscription> _early_abort_subscription;
void do_abort() noexcept;
public:
view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, abort_source& as);
view_update_generator(replica::database& db, sharded<service::storage_proxy>& proxy, node_update_backlog& node_backlog, abort_source& as);
~view_update_generator();
future<> start();

68
dist/CMakeLists.txt vendored
View File

@@ -141,4 +141,72 @@ add_dependencies(dist
dist-python3
dist-server)
set(dist_rpm_dir "${CMAKE_BINARY_DIR}/$<CONFIG>/dist/rpm")
set(dist_deb_dir "${CMAKE_BINARY_DIR}/$<CONFIG>/dist/deb")
# Map system processor to Debian architecture names
if(CMAKE_SYSTEM_PROCESSOR STREQUAL "x86_64")
set(deb_arch "amd64")
elseif(CMAKE_SYSTEM_PROCESSOR STREQUAL "aarch64")
set(deb_arch "arm64")
else()
message(FATAL_ERROR "Unsupported architecture: ${CMAKE_SYSTEM_PROCESSOR}")
endif()
set(rpm_ver "${Scylla_VERSION}-${Scylla_RELEASE}")
set(deb_ver "${Scylla_VERSION}-${Scylla_RELEASE}-1")
set(rpm_arch "${CMAKE_SYSTEM_PROCESSOR}")
set(server_rpms_dir "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/redhat/RPMS/${rpm_arch}")
set(server_rpms
"${server_rpms_dir}/${Scylla_PRODUCT}-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-server-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-server-debuginfo-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-conf-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-kernel-conf-${rpm_ver}.${rpm_arch}.rpm"
"${server_rpms_dir}/${Scylla_PRODUCT}-node-exporter-${rpm_ver}.${rpm_arch}.rpm")
set(cqlsh_rpms
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/redhat/RPMS/${rpm_arch}/${Scylla_PRODUCT}-cqlsh-${rpm_ver}.${rpm_arch}.rpm")
set(python3_rpms
"${CMAKE_SOURCE_DIR}/tools/python3/build/redhat/RPMS/${rpm_arch}/${Scylla_PRODUCT}-python3-${rpm_ver}.${rpm_arch}.rpm")
set(server_debs_dir "${CMAKE_CURRENT_BINARY_DIR}/$<CONFIG>/debian")
set(server_debs
"${server_debs_dir}/${Scylla_PRODUCT}_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-server_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-server-dbg_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-conf_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-kernel-conf_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/${Scylla_PRODUCT}-node-exporter_${deb_ver}_${deb_arch}.deb"
"${server_debs_dir}/scylla-enterprise_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-server_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-conf_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-kernel-conf_${deb_ver}_all.deb"
"${server_debs_dir}/scylla-enterprise-node-exporter_${deb_ver}_all.deb")
set(cqlsh_debs
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/debian/${Scylla_PRODUCT}-cqlsh_${deb_ver}_${deb_arch}.deb"
"${CMAKE_SOURCE_DIR}/tools/cqlsh/build/debian/scylla-enterprise-cqlsh_${deb_ver}_all.deb")
set(python3_debs
"${CMAKE_SOURCE_DIR}/tools/python3/build/debian/${Scylla_PRODUCT}-python3_${deb_ver}_${deb_arch}.deb"
"${CMAKE_SOURCE_DIR}/tools/python3/build/debian/scylla-enterprise-python3_${deb_ver}_all.deb")
add_custom_target(collect-dist-rpm
COMMAND ${CMAKE_COMMAND} -E rm -rf ${dist_rpm_dir}
COMMAND ${CMAKE_COMMAND} -E make_directory ${dist_rpm_dir}
COMMAND ${CMAKE_COMMAND} -E copy ${server_rpms} ${cqlsh_rpms} ${python3_rpms} ${dist_rpm_dir}/
DEPENDS dist
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
COMMENT "Collecting RPMs into ${dist_rpm_dir}")
add_custom_target(collect-dist-deb
COMMAND ${CMAKE_COMMAND} -E rm -rf ${dist_deb_dir}
COMMAND ${CMAKE_COMMAND} -E make_directory ${dist_deb_dir}
COMMAND ${CMAKE_COMMAND} -E copy ${server_debs} ${cqlsh_debs} ${python3_debs} ${dist_deb_dir}/
DEPENDS dist
WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
COMMENT "Collecting DEBs into ${dist_deb_dir}")
add_custom_target(collect-dist
DEPENDS collect-dist-rpm collect-dist-deb)
add_subdirectory(debuginfo)

View File

@@ -324,6 +324,13 @@ experimental:
stream events. Without this option, such no-op operations may still
generate spurious stream events.
<https://github.com/scylladb/scylladb/issues/28368>
* When a stream is disabled, no new records are written but the existing
stream data is preserved and remains readable through its original
StreamArn. The data expires via TTL after 24 hours. Re-enabling the
stream purges the old data immediately and produces a new StreamArn.
In contrast, DynamoDB keeps the old stream and its data readable for
24 hours through the old StreamArn even after re-enabling.
<https://scylladb.atlassian.net/browse/SCYLLADB-1873>
## Unimplemented API features

View File

@@ -415,7 +415,7 @@ An empty list is allowed, and it's equivalent to numeric replication factor of 0
.. code-block:: cql
ALTER KEYSPACE Excelsior
WITH replication = { 'class' : 'NetworkTopologyStrategy', dc2' : []};
WITH replication = { 'class' : 'NetworkTopologyStrategy', 'dc2' : []};
Altering from a rack list to a numeric replication factor is not supported.
@@ -1017,11 +1017,11 @@ For example:
CREATE TABLE customer_data (
cust_id uuid,
cust_first-name text,
cust_last-name text,
"cust_first-name" text,
"cust_last-name" text,
cust_phone text,
cust_get-sms text,
PRIMARY KEY (customer_id)
"cust_get-sms" text,
PRIMARY KEY (cust_id)
) WITH cdc = { 'enabled' : 'true', 'preimage' : 'true' };
.. _cql-caching-options:

View File

@@ -24,7 +24,8 @@ For example:
INSERT INTO NerdMovies (movie, director, main_actor, year)
VALUES ('Serenity', 'Joss Whedon', 'Nathan Fillion', 2005)
USING TTL 86400 IF NOT EXISTS;
IF NOT EXISTS
USING TTL 86400;
The ``INSERT`` statement writes one or more columns for a given row in a table. Note that since a row is identified by
its ``PRIMARY KEY``, at least the columns composing it must be specified. The list of columns to insert to must be

View File

@@ -507,7 +507,7 @@ For example::
CREATE TABLE superheroes (
name frozen<full_name> PRIMARY KEY,
home address
home frozen<address>
);
.. note::

View File

@@ -271,7 +271,7 @@ The json structure is as follows:
}
The `manifest` member contains the following attributes:
- `version` - respresenting the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `version` - representing the version of the manifest itself. It is incremented when members are added or removed from the manifest.
- `scope` - the scope of metadata stored in this manifest file. The following scopes are supported:
- `node` - the manifest describes all SSTables owned by this node in this snapshot.

View File

@@ -12,7 +12,9 @@ Schema:
CREATE TABLE system_schema.keyspaces (
keyspace_name text PRIMARY KEY,
durable_writes boolean,
replication frozen<map<text, text>>
replication frozen<map<text, text>>,
replication_v2 frozen<map<text, text>>,
next_replication frozen<map<text, text>>
)
```
@@ -31,6 +33,8 @@ Columns:
stored as a flattened map of the extended options map (see below).
For `SimpleStrategy` there is a single option `"replication_factor"` specifying the replication factor.
* `next_replication` - the target replication factor for the keyspace during rf change.
If there is no ongoing rf change, `next_replication` value is not set.
Extended options map used by NetworkTopologyStrategy is a map where values can be either strings or lists of strings.

View File

@@ -146,6 +146,25 @@ AWS Security Token Service (STS) or the EC2 Instance Metadata Service.
- When set, these values are used by the S3 client to sign requests.
- If not set, requests are sent unsigned, which may not be accepted by all servers.
.. _admin-oci-object-storage:
Using Oracle OCI Object Storage
=================================
Oracle Cloud Infrastructure (OCI) Object Storage is compatible with the Amazon
S3 API, so it works with ScyllaDB without additional configuration.
To use OCI Object Storage, follow the same configuration as for AWS S3, and
specify your OCI S3-compatible endpoint.
Example:
.. code:: yaml
object_storage_endpoints:
- name: https://idedxcgnkfkt.compat.objectstorage.us-ashburn-1.oci.customer-oci.com:443
aws_region: us-ashburn-1
.. _admin-compression:
Compression

View File

@@ -231,6 +231,46 @@ Add New DC
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
If the keyspace uses rack list replication, update the replication factor in one ``ALTER KEYSPACE`` statement, under the following rules:
* Existing datacenters must keep their current replication factor.
* A new datacenter can be assigned a replication factor (**0 to N**).
* An existing datacenter can be removed (**N to 0**).
.. warning::
While adding a new datacenter and altering keyspaces, do **not** perform any reads or writes that involve the new datacenter.
In particular, avoid using global consistency levels (such as ``ALL``, ``EACH_QUORUM``) that would include the new datacenter in the operation.
Use ``LOCAL_*`` consistency levels (e.g., ``LOCAL_QUORUM``, ``LOCAL_ONE``) until the new datacenter is fully operational.
Before
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace4;
CREATE KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>']} AND tablets = { 'enabled': true };
The following is **not** allowed because it changes the replication factor of ``<existing_dc>`` (adds ``<existing_rack4>``) and adds ``<new_dc>`` in the same statement:
.. code-block:: cql
ALTER KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>', '<existing_rack4>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
Add all the nodes to the new datacenter and then:
.. code-block:: cql
ALTER KEYSPACE mykeyspace4 WITH replication = { 'class' : 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
After
.. code-block:: cql
DESCRIBE KEYSPACE mykeyspace4;
CREATE KEYSPACE mykeyspace4 WITH REPLICATION = {'class': 'NetworkTopologyStrategy', '<existing_dc>' : ['<existing_rack1>', '<existing_rack2>', '<existing_rack3>'], '<new_dc>' : ['<new_rack1>', '<new_rack2>', '<new_rack3>']} AND tablets = { 'enabled': true };
You can abort the keyspace alteration using :doc:`Task manager </operating-scylla/admin-tools/task-manager>`.
#. If any vnode keyspace was altered, run ``nodetool rebuild`` on each node in the new datacenter, specifying the existing datacenter name in the rebuild command.
For example:

View File

@@ -102,6 +102,34 @@ Procedure
Consider :ref:`upgrading rf_rack_valid_keyspaces option to enforce_rack_list option <keyspace-rf-rack-valid-to-enforce-rack-list>` to ensure all tablet keyspaces use rack lists.
If the keyspace uses rack list replication, update the replication factor in one ``ALTER KEYSPACE`` statement, under the following rules:
* Existing datacenters must keep their current replication factor.
* An existing datacenter can be removed (**N to 0**).
* A new datacenter can be assigned a replication factor (**0 to N**).
.. warning::
While removing a datacenter and altering keyspaces, do **not** perform any reads or writes that involve the datacenter being removed.
In particular, avoid using global consistency levels (such as ``ALL``, ``EACH_QUORUM``) that would include the decommissioned datacenter in the operation.
Use ``LOCAL_*`` consistency levels (e.g., ``LOCAL_QUORUM``, ``LOCAL_ONE``) until the datacenter is fully decommissioned.
.. code-block:: shell
cqlsh> DESCRIBE nba4
cqlsh> CREATE KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : ['RAC4', 'RAC5'], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
The following is **not** allowed because it changes the replication factor of ``EUROPE-DC`` (adds ``RAC9``) and removes ``ASIA-DC`` in the same statement:
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : [], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8', 'RAC9']} AND tablets = { 'enabled': true };
Remove all replicas from the decommissioned datacenter:
.. code-block:: shell
cqlsh> ALTER KEYSPACE nba4 WITH REPLICATION = {'class' : 'NetworkTopologyStrategy', 'US-DC' : ['RAC1', 'RAC2', 'RAC3'], 'ASIA-DC' : [], 'EUROPE-DC' : ['RAC6', 'RAC7', 'RAC8']} AND tablets = { 'enabled': true };
.. note::
If table audit is enabled, the ``audit`` keyspace is automatically created with ``NetworkTopologyStrategy``.
@@ -113,6 +141,10 @@ Procedure
Failure to do so will result in decommission errors such as "zero replica after the removal".
.. warning::
Removal of replicas from a datacenter cannot be aborted. To get back to the previous replication, wait until the ALTER KEYSPACE finishes and then add the replicas back by running another ALTER KEYSPACE statement.
#. Run :doc:`nodetool decommission </operating-scylla/nodetool-commands/decommission>` on every node in the data center that is to be removed.
Refer to :doc:`Remove a Node from a ScyllaDB Cluster - Down Scale </operating-scylla/procedures/cluster-management/remove-node>` for further information.

View File

@@ -4,7 +4,7 @@ Upgrade ScyllaDB
.. toctree::
ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1/index>
ScyllaDB 2026.1 to ScyllaDB 2026.2 <upgrade-guide-from-2026.1-to-2026.2/index>
ScyllaDB 2026.x Patch Upgrades <upgrade-guide-from-2026.x.y-to-2026.x.z>
ScyllaDB Image <ami-upgrade>

View File

@@ -1,13 +0,0 @@
==========================================================
Upgrade - ScyllaDB 2025.x to ScyllaDB 2026.1
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2025.x-to-2026.1>
Metrics Update <metric-update-2025.x-to-2026.1>
* :doc:`Upgrade from ScyllaDB 2025.x to ScyllaDB 2026.1 <upgrade-guide-from-2025.x-to-2026.1>`
* :doc:`Metrics Update Between 2025.x and 2026.1 <metric-update-2025.x-to-2026.1>`

View File

@@ -1,82 +0,0 @@
.. |SRC_VERSION| replace:: 2025.x
.. |NEW_VERSION| replace:: 2026.1
.. |PRECEDING_VERSION| replace:: 2025.4
================================================================
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_alternator_operation_size_kb
- Histogram of item sizes involved in a request.
* - scylla_column_family_total_disk_space_before_compression
- Hypothetical total disk space used if data files weren't compressed
* - scylla_group_name_auto_repair_enabled_nr
- Number of tablets with auto repair enabled.
* - scylla_group_name_auto_repair_needs_repair_nr
- Number of tablets with auto repair enabled that currently need repair.
* - scylla_lsa_compact_time_ms
- Total time spent on segment compaction that was not accounted under ``reclaim_time_ms``.
* - scylla_lsa_evict_time_ms
- Total time spent on evicting objects that was not accounted under ``reclaim_time_ms``,
* - scylla_lsa_reclaim_time_ms
- Total time spent in reclaiming LSA memory back to std allocator.
* - scylla_object_storage_memory_usage
- Total number of bytes consumed by the object storage client.
* - scylla_tablet_ops_failed
- Number of failed tablet auto repair attempts.
* - scylla_tablet_ops_succeeded
- Number of successful tablet auto repair attempts.
Renamed Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are renamed in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric Name in |PRECEDING_VERSION|
- Metric Name in |NEW_VERSION|
* - scylla_s3_memory_usage
- scylla_object_storage_memory_usage
Removed Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are removed in ScyllaDB |NEW_VERSION|.
* scylla_redis_current_connections
* scylla_redis_op_latency
* scylla_redis_operation
* scylla_redis_operation
* scylla_redis_requests_latency
* scylla_redis_requests_served
* scylla_redis_requests_serving
New and Updated Metrics in Previous Releases
-------------------------------------------------------
* `Metrics Update Between 2025.3 and 2025.4 <https://docs.scylladb.com/manual/branch-2025.4/upgrade/upgrade-guides/upgrade-guide-from-2025.x-to-2025.4/metric-update-2025.x-to-2025.4.html>`_
* `Metrics Update Between 2025.2 and 2025.3 <https://docs.scylladb.com/manual/branch-2025.3/upgrade/upgrade-guides/upgrade-guide-from-2025.2-to-2025.3/metric-update-2025.2-to-2025.3.html>`_
* `Metrics Update Between 2025.1 and 2025.2 <https://docs.scylladb.com/manual/branch-2025.2/upgrade/upgrade-guides/upgrade-guide-from-2025.1-to-2025.2/metric-update-2025.1-to-2025.2.html>`_

View File

@@ -0,0 +1,13 @@
==========================================================
Upgrade - ScyllaDB 2026.1 to ScyllaDB 2026.2
==========================================================
.. toctree::
:maxdepth: 2
:hidden:
Upgrade ScyllaDB <upgrade-guide-from-2026.1-to-2026.2>
Metrics Update <metric-update-2026.1-to-2026.2>
* :doc:`Upgrade from ScyllaDB 2026.1 to ScyllaDB 2026.2 <upgrade-guide-from-2026.1-to-2026.2>`
* :doc:`Metrics Update Between 2026.1 and 2026.2 <metric-update-2026.1-to-2026.2>`

View File

@@ -0,0 +1,126 @@
.. |SRC_VERSION| replace:: 2026.1
.. |NEW_VERSION| replace:: 2026.2
.. |PRECEDING_VERSION| replace:: 2026.1
================================================================
Metrics Update Between |SRC_VERSION| and |NEW_VERSION|
================================================================
.. toctree::
:maxdepth: 2
:hidden:
ScyllaDB |NEW_VERSION| Dashboards are available as part of the latest |mon_root|.
New Metrics in |NEW_VERSION|
--------------------------------------
The following metrics are new in ScyllaDB |NEW_VERSION| compared to |PRECEDING_VERSION|.
.. list-table::
:widths: 25 150
:header-rows: 1
* - Metric
- Description
* - scylla_auth_cache_permissions
- Total number of permission sets currently cached across all roles.
* - scylla_auth_cache_roles
- Number of roles currently cached.
* - scylla_cql_forwarded_requests
- Counts the total number of attempts to forward CQL requests to other nodes.
One request may be forwarded multiple times, particularly when a write is
handled by a non-replica node.
* - scylla_cql_write_consistency_levels_disallowed_violations
- Counts the number of write_consistency_levels_disallowed guardrail violations,
i.e. attempts to write with a forbidden consistency level.
* - scylla_cql_write_consistency_levels_warned_violations
- Counts the number of write_consistency_levels_warned guardrail violations,
i.e. attempts to write with a discouraged consistency level.
* - scylla_cql_writes_per_consistency_level
- Counts the number of writes for each consistency level.
* - scylla_io_queue_integrated_disk_queue_length
- Length of the integrated disk queue.
* - scylla_io_queue_integrated_queue_length
- Length of the integrated queue.
* - scylla_logstor_sm_bytes_freed
- Counts the number of data bytes freed.
* - scylla_logstor_sm_bytes_read
- Counts the number of bytes read from the disk.
* - scylla_logstor_sm_bytes_written
- Counts the number of bytes written to the disk.
* - scylla_logstor_sm_compaction_bytes_written
- Counts the number of bytes written to the disk by compaction.
* - scylla_logstor_sm_compaction_data_bytes_written
- Counts the number of data bytes written to the disk by compaction.
* - scylla_logstor_sm_compaction_records_rewritten
- Counts the number of records rewritten during compaction.
* - scylla_logstor_sm_compaction_records_skipped
- Counts the number of records skipped during compaction.
* - scylla_logstor_sm_compaction_segments_freed
- Counts the number of data bytes written to the disk.
* - scylla_logstor_sm_disk_usage
- Total disk usage.
* - scylla_logstor_sm_free_segments
- Counts the number of free segments currently available.
* - scylla_logstor_sm_segment_pool_compaction_segments_get
- Counts the number of segments taken from the segment pool for compaction.
* - scylla_logstor_sm_segment_pool_normal_segments_get
- Counts the number of segments taken from the segment pool for normal writes.
* - scylla_logstor_sm_segment_pool_normal_segments_wait
- Counts the number of times normal writes had to wait for a segment to become
available in the segment pool.
* - scylla_logstor_sm_segment_pool_segments_put
- Counts the number of segments returned to the segment pool.
* - scylla_logstor_sm_segment_pool_separator_segments_get
- Counts the number of segments taken from the segment pool for separator writes.
* - scylla_logstor_sm_segment_pool_size
- Counts the number of segments in the segment pool.
* - scylla_logstor_sm_segments_allocated
- Counts the number of segments allocated.
* - scylla_logstor_sm_segments_compacted
- Counts the number of segments compacted.
* - scylla_logstor_sm_segments_freed
- Counts the number of segments freed.
* - scylla_logstor_sm_segments_in_use
- Counts the number of segments currently in use.
* - scylla_logstor_sm_separator_buffer_flushed
- Counts the number of times the separator buffer has been flushed.
* - scylla_logstor_sm_separator_bytes_written
- Counts the number of bytes written to the separator.
* - scylla_logstor_sm_separator_data_bytes_written
- Counts the number of data bytes written to the separator.
* - scylla_logstor_sm_separator_flow_control_delay
- Current delay applied to writes to control separator debt in microseconds.
* - scylla_logstor_sm_separator_segments_freed
- Counts the number of segments freed by the separator.
* - scylla_transport_cql_pending_response_memory
- Holds the total memory in bytes consumed by responses waiting to be sent.
* - scylla_transport_cql_request_histogram_bytes
- A histogram of received bytes in CQL messages of a specific kind and
specific scheduling group.
* - scylla_transport_cql_requests_serving
- Holds the number of requests that are being processed right now.
* - scylla_transport_cql_response_histogram_bytes
- A histogram of received bytes in CQL messages of a specific kind and
specific scheduling group.
* - scylla_transport_requests_forwarded_failed
- Counts the number of requests that were forwarded to another replica
but failed to execute there.
* - scylla_transport_requests_forwarded_prepared_not_found
- Counts the number of requests that were forwarded to another replica
but failed there because the statement was not prepared on the target.
When this happens, the coordinator performs an additional remote call
to prepare the statement on the replica and retries the EXECUTE request
afterwards.
* - scylla_transport_requests_forwarded_redirected
- Counts the number of requests that were forwarded to another replica
but that replica responded with a redirect to another node. This can
happen when replica has stale information about the cluster topology or
when the request is handled by a node that is not a replica for the data
being accessed by the request.
* - scylla_transport_requests_forwarded_successfully
- Counts the number of requests that were forwarded to another replica
and executed successfully there.

View File

@@ -1,13 +1,13 @@
.. |SCYLLA_NAME| replace:: ScyllaDB
.. |SRC_VERSION| replace:: 2025.x
.. |NEW_VERSION| replace:: 2026.1
.. |SRC_VERSION| replace:: 2026.1
.. |NEW_VERSION| replace:: 2026.2
.. |ROLLBACK| replace:: rollback
.. _ROLLBACK: ./#rollback-procedure
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2025.x to 2026.1
.. _SCYLLA_METRICS: ../metric-update-2025.x-to-2026.1
.. |SCYLLA_METRICS| replace:: ScyllaDB Metrics Update - ScyllaDB 2026.1 to 2026.2
.. _SCYLLA_METRICS: ../metric-update-2026.1-to-2026.2
=======================================================================================
Upgrade from |SCYLLA_NAME| |SRC_VERSION| to |SCYLLA_NAME| |NEW_VERSION|

View File

@@ -598,7 +598,7 @@ future<int> kmip_host::impl::do_cmd(KMIP_CMD* cmd, con_ptr cp, Func& f, bool ret
template<typename Func>
future<kmip_host::impl::kmip_cmd> kmip_host::impl::do_cmd(kmip_cmd cmd_in, Func && f) {
kmip_log.trace("{}: begin do_cmd", *this, cmd_in);
kmip_log.trace("{}: begin do_cmd {}", *this, cmd_in);
KMIP_CMD* cmd = cmd_in;
// #998 Need to do retry loop, because we can have either timed out connection,

View File

@@ -616,7 +616,7 @@ future<rjson::value> encryption::kms_host::impl::do_post(std::string_view target
static auto get_xml_node = [](node_type* node, const char* what) {
auto res = node->first_node(what);
if (!res) {
throw malformed_response_error(fmt::format("XML parse error", what));
throw malformed_response_error(fmt::format("XML parse error: {}", what));
}
return res;
};

View File

@@ -7,6 +7,7 @@
#include <seastar/core/sstring.hh>
#include <seastar/core/seastar.hh>
#include <seastar/core/smp.hh>
#include "db/schema_features.hh"
#include "utils/log.hh"
#include "gms/feature.hh"
#include "gms/feature_service.hh"
@@ -108,6 +109,7 @@ std::set<std::string_view> feature_service::supported_feature_set() const {
"UUID_SSTABLE_IDENTIFIERS"sv,
"GROUP0_SCHEMA_VERSIONING"sv,
"VIEW_BUILD_STATUS_ON_GROUP0"sv,
"CDC_GENERATIONS_V2"sv,
};
if (is_test_only_feature_deprecated()) {
@@ -179,6 +181,7 @@ db::schema_features feature_service::cluster_schema_features() const {
f.set<db::schema_feature::GROUP0_SCHEMA_VERSIONING>();
f.set_if<db::schema_feature::IN_MEMORY_TABLES>(bool(in_memory_tables));
f.set_if<db::schema_feature::TABLET_OPTIONS>(bool(tablet_options));
f.set_if<db::schema_feature::KEYSPACE_MULTI_RF_CHANGE>(bool(keyspace_multi_rf_change));
return f;
}

View File

@@ -83,7 +83,6 @@ public:
gms::feature alternator_ttl { *this, "ALTERNATOR_TTL"sv };
gms::feature cql_row_ttl { *this, "CQL_ROW_TTL"sv };
gms::feature range_scan_data_variant { *this, "RANGE_SCAN_DATA_VARIANT"sv };
gms::feature cdc_generations_v2 { *this, "CDC_GENERATIONS_V2"sv };
gms::feature user_defined_aggregates { *this, "UDA"sv };
// Historically max_result_size contained only two fields: soft_limit and
// hard_limit. It was somehow obscure because for normal paged queries both
@@ -182,6 +181,7 @@ public:
gms::feature writetime_ttl_individual_element { *this, "WRITETIME_TTL_INDIVIDUAL_ELEMENT"sv };
gms::feature arbitrary_tablet_boundaries { *this, "ARBITRARY_TABLET_BOUNDARIES"sv };
gms::feature large_data_virtual_tables { *this, "LARGE_DATA_VIRTUAL_TABLES"sv };
gms::feature keyspace_multi_rf_change { *this, "KEYSPACE_MULTI_RF_CHANGE"sv };
public:
const std::unordered_map<sstring, std::reference_wrapper<feature>>& registered_features() const;

View File

@@ -399,9 +399,10 @@ future<> gossiper::do_send_ack2_msg(locator::host_id from, utils::chunked_vector
}
}
gms::gossip_digest_ack2 ack2_msg(std::move(delta_ep_state_map));
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
auto ack2_msg_str = fmt::format("{}", ack2_msg);
logger.debug("Calling do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
co_await ser::gossip_rpc_verbs::send_gossip_digest_ack2(&_messaging, from, std::move(ack2_msg));
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg);
logger.debug("finished do_send_ack2_msg to node {}, ack_msg_digest={}, ack2_msg={}", from, ack_msg_digest, ack2_msg_str);
}
// Depends on
@@ -964,8 +965,7 @@ future<> gossiper::failure_detector_loop_for_node(locator::host_id host_id, gene
diff = now - last;
if (!failed) {
last = now;
}
if (diff > max_duration) {
} else if (diff > max_duration) {
logger.info("failure_detector_loop: Mark node {}/{} as DOWN", host_id, node);
co_await container().invoke_on(0, [host_id] (gms::gossiper& g) {
return g.convict(host_id);

View File

@@ -87,9 +87,6 @@ std::set<sstring> get_disabled_features_from_db_config(const db::config& cfg, st
}
}
if (!cfg.check_experimental(db::experimental_features_t::feature::ALTERNATOR_STREAMS)) {
disabled.insert("ALTERNATOR_STREAMS"s);
}
if (!cfg.check_experimental(db::experimental_features_t::feature::KEYSPACE_STORAGE_OPTIONS)) {
disabled.insert("KEYSPACE_STORAGE_OPTIONS"s);
}

View File

@@ -381,6 +381,10 @@ public:
return _nodes.at(node)._du.capacity;
}
bool has_node(host_id node) const {
return _nodes.contains(node);
}
shard_id get_shard_count(host_id node) const {
if (!_nodes.contains(node)) {
return 0;

View File

@@ -153,19 +153,27 @@ struct hash<locator::range_based_tablet_id> {
namespace locator {
/// Creates a new replica set with old_replica replaced by new_replica.
/// If there is no old_replica, the set is returned unchanged.
/// Returns a copy of the replica set with the following modifications:
/// - If both old_replica and new_replica are set, old_replica is substituted
/// with new_replica. If old_replica is not found in rs, the set is returned as-is.
/// - If only old_replica is set, it is removed from the result.
/// - If only new_replica is set, it is appended to the result.
inline
tablet_replica_set replace_replica(const tablet_replica_set& rs, tablet_replica old_replica, tablet_replica new_replica) {
tablet_replica_set replace_replica(const tablet_replica_set& rs, std::optional<tablet_replica> old_replica, std::optional<tablet_replica> new_replica) {
tablet_replica_set result;
result.reserve(rs.size());
for (auto&& r : rs) {
if (r == old_replica) {
result.push_back(new_replica);
if (old_replica.has_value() && r == old_replica.value()) {
if (new_replica.has_value()) {
result.push_back(new_replica.value());
}
} else {
result.push_back(r);
}
}
if (!old_replica.has_value() && new_replica.has_value()) {
result.push_back(new_replica.value());
}
return result;
}
@@ -383,8 +391,8 @@ bool is_post_cleanup(tablet_replica replica, const tablet_info& tinfo, const tab
struct tablet_migration_info {
locator::tablet_transition_kind kind;
locator::global_tablet_id tablet;
locator::tablet_replica src;
locator::tablet_replica dst;
std::optional<locator::tablet_replica> src;
std::optional<locator::tablet_replica> dst;
};
class tablet_map;

61
main.cc
View File

@@ -942,7 +942,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
auto background_reclaim_scheduling_group = create_scheduling_group("background_reclaim", "bgre", 50).get();
// Maintenance supergroup -- the collection of background low-prio activites
// Maintenance supergroup -- the collection of background low-prio activities
auto maintenance_supergroup = create_scheduling_supergroup(200).get();
auto bandwidth_updater = io_throughput_updater("maintenance supergroup", maintenance_supergroup,
cfg->maintenance_io_throughput_mb_per_sec.is_set() ? cfg->maintenance_io_throughput_mb_per_sec : cfg->stream_io_throughput_mb_per_sec);
@@ -1358,6 +1358,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
};
spcfg.hinted_handoff_enabled = hinted_handoff_enabled;
spcfg.available_memory = memory::stats().total_memory();
spcfg.maintenance_mode = maintenance_mode_enabled{cfg->maintenance_mode()};
smp_service_group_config storage_proxy_smp_service_group_config;
// Assuming less than 1kB per queued request, this limits storage_proxy submit_to() queues to 5MB or less
storage_proxy_smp_service_group_config.max_nonlocal_requests = 5000;
@@ -1366,7 +1367,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
spcfg.write_mv_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get();
spcfg.hints_write_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get();
spcfg.write_ack_smp_service_group = create_smp_service_group(storage_proxy_smp_service_group_config).get();
static db::view::node_update_backlog node_backlog(smp::count, 10ms);
static db::view::node_update_backlog node_backlog(smp::count, 10ms, cfg->view_flow_control_delay_limit_in_ms);
scheduling_group_key_config storage_proxy_stats_cfg =
make_scheduling_group_key_config<service::storage_proxy_stats::stats>();
storage_proxy_stats_cfg.constructor = [plain_constructor = storage_proxy_stats_cfg.constructor] (void* ptr) {
@@ -1810,6 +1811,18 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
utils::get_local_injector().inject("stop_after_starting_migration_manager",
[] { std::raise(SIGSTOP); });
// Audit must be constructed before the maintenance socket so
// that on shutdown (reverse destruction order) the audit service
// outlives the maintenance socket and in-flight queries can
// still reach audit::inspect() safely.
checkpoint(stop_signal, "starting audit service");
audit::audit::start_audit(*cfg, token_metadata, qp, mm).handle_exception([&] (auto&& e) {
startlog.error("audit start failed: {}", e);
}).get();
auto audit_stop = defer([] {
audit::audit::stop_audit().get();
});
// XXX: stop_raft has to happen before query_processor and migration_manager
// is stopped, since some groups keep using the query
// processor until are stopped inside stop_raft.
@@ -1841,7 +1854,7 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
});
checkpoint(stop_signal, "starting view update generator");
view_update_generator.start(std::ref(db), std::ref(proxy), std::ref(stop_signal.as_sharded_abort_source())).get();
view_update_generator.start(std::ref(db), std::ref(proxy), std::ref(node_backlog), std::ref(stop_signal.as_sharded_abort_source())).get();
auto stop_view_update_generator = defer_verbose_shutdown("view update generator", [] {
view_update_generator.stop().get();
});
@@ -2287,10 +2300,12 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
ss.local().wait_for_group0_stop().get();
});
// Setup group0 early in case the node is bootstrapped already and the group exists.
// Need to do it before allowing incoming messaging service connections since
// storage proxy's and migration manager's verbs may access group0.
group0_service.setup_group0_if_exist(sys_ks.local(), ss.local(), qp.local(), mm.local()).get();
if (!group0_service.maintenance_mode() && sys_ks.local().bootstrap_complete()) {
// Setup group0 early in case the node is bootstrapped already and the group exists.
// Need to do it before allowing incoming messaging service connections since
// storage proxy's and migration manager's verbs may access group0.
group0_service.setup_group0_if_exist(sys_ks.local(), ss.local(), qp.local(), mm.local()).get();
}
// The call to setup_group0_if_exists() above guarantees that, if group0 is
// created and started, the locally persisted group0 state has been applied
@@ -2340,15 +2355,6 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
}).get();
stop_signal.ready(false);
if (cfg->maintenance_socket() != "ignore") {
// Enable role operations now that node joined the cluster
maintenance_auth_service.invoke_on_all([](auth::service& svc) {
return auth::ensure_role_operations_are_enabled(svc);
}).get();
start_cql(*cql_maintenance_server_ctl, stop_maintenance_cql, "maintenance native server");
}
// At this point, `locator::topology` should be stable, i.e. we should have complete information
// about the layout of the cluster (= list of nodes along with the racks/DCs).
startlog.info("Verifying that all of the keyspaces are RF-rack-valid");
@@ -2357,16 +2363,23 @@ To start the scylla server proper, simply invoke as: scylla server (or just scyl
startlog.info("Verifying that all of the tablet keyspaces use rack list replication factors");
db.local().check_rack_list_everywhere(cfg->enforce_rack_list());
// Start audit service after join_cluster so that the table-based audit backend
// can properly create its keyspace and table.
checkpoint(stop_signal, "starting audit service");
audit::audit::start_audit(*cfg, token_metadata, qp, mm).handle_exception([&] (auto&& e) {
startlog.error("audit start failed: {}", e);
}).get();
auto audit_stop = defer([] {
audit::audit::stop_audit().get();
// The table-based audit backend needs Raft (via join_cluster)
// to create its keyspace and table.
checkpoint(stop_signal, "starting audit storage");
audit::audit::start_storage(*cfg).get();
auto audit_storage_stop = defer([] {
audit::audit::stop_storage().get();
});
if (cfg->maintenance_socket() != "ignore") {
// Enable role operations now that node joined the cluster
maintenance_auth_service.invoke_on_all([](auth::service& svc) {
return auth::ensure_role_operations_are_enabled(svc);
}).get();
start_cql(*cql_maintenance_server_ctl, stop_maintenance_cql, "maintenance native server");
}
// Semantic validation of sstable compression parameters from config.
// Adding here (i.e., after `join_cluster`) to ensure that the
// required SSTABLE_COMPRESSION_DICTS cluster feature has been negotiated.

View File

@@ -48,8 +48,8 @@ static void set_field(atomic_cell_value& out, unsigned offset, T val) {
}
template <FragmentRange Buffer>
static void set_value(managed_bytes& b, unsigned value_offset, const Buffer& value) {
auto v = managed_bytes_mutable_view(b).substr(value_offset, value.size_bytes());
static void set_value(atomic_cell_value_mutable_view b, unsigned value_offset, const Buffer& value) {
auto v = b.substr(value_offset, value.size_bytes());
for (auto frag : value) {
write_fragmented(v, single_fragmented_view(frag));
}
@@ -141,20 +141,36 @@ public:
SCYLLA_ASSERT(is_live_and_has_ttl(cell));
return gc_clock::duration(get_field<int32_t>(cell, ttl_offset));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), flags_size + timestamp_size + deletion_time_size);
static size_t dead_serialized_size() {
return flags_size + timestamp_size + deletion_time_size;
}
static size_t live_serialized_size(size_t value_size) {
return flags_size + timestamp_size + value_size;
}
static size_t live_expiring_serialized_size(size_t value_size) {
return flags_size + timestamp_size + expiry_size + ttl_size + value_size;
}
static void write_dead(atomic_cell_value_mutable_view b, api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
b[0] = 0;
set_field(b, timestamp_offset, timestamp);
set_field(b, deletion_time_offset, static_cast<int64_t>(deletion_time.time_since_epoch().count()));
}
static managed_bytes make_dead(api::timestamp_type timestamp, gc_clock::time_point deletion_time) {
managed_bytes b(managed_bytes::initialized_later(), dead_serialized_size());
write_dead(b, timestamp, deletion_time);
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
static void write_live(atomic_cell_value_mutable_view b, api::timestamp_type timestamp, const Buffer& value) {
auto value_offset = flags_size + timestamp_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_value(b, value_offset, value);
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value) {
managed_bytes b(managed_bytes::initialized_later(), live_serialized_size(value.size_bytes()));
write_live(b, timestamp, value);
return b;
}
static managed_bytes make_live_counter_update(api::timestamp_type timestamp, int64_t value) {
@@ -166,14 +182,18 @@ public:
return b;
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
static void write_live(atomic_cell_value_mutable_view b, api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
auto value_offset = flags_size + timestamp_size + expiry_size + ttl_size;
managed_bytes b(managed_bytes::initialized_later(), value_offset + value.size_bytes());
b[0] = EXPIRY_FLAG | LIVE_FLAG;
set_field(b, timestamp_offset, timestamp);
set_field(b, expiry_offset, static_cast<int64_t>(expiry.time_since_epoch().count()));
set_field(b, ttl_offset, static_cast<int32_t>(ttl.count()));
set_value(b, value_offset, value);
}
template <FragmentRange Buffer>
static managed_bytes make_live(api::timestamp_type timestamp, const Buffer& value, gc_clock::time_point expiry, gc_clock::duration ttl) {
managed_bytes b(managed_bytes::initialized_later(), live_expiring_serialized_size(value.size_bytes()));
write_live(b, timestamp, value, expiry, ttl);
return b;
}
static managed_bytes make_live_uninitialized(api::timestamp_type timestamp, size_t size) {

View File

@@ -113,10 +113,10 @@ auto fmt::formatter<canonical_mutation>::format(const canonical_mutation& cm, fm
auto&& entry = _cm.static_column_at(id);
_os = fmt::format_to(_os, "static column {} {}", bytes_to_text(entry.name()), atomic_cell::printer(*entry.type(), ac));
}
virtual void accept_static_cell(column_id id, collection_mutation_view cmv) override {
virtual void accept_static_cell(column_id id, collection_mutation cm) override {
print_separator();
auto&& entry = _cm.static_column_at(id);
_os = fmt::format_to(_os, "static column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
_os = fmt::format_to(_os, "static column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cm));
}
virtual stop_iteration accept_row_tombstone(range_tombstone rt) override {
print_separator();
@@ -137,10 +137,10 @@ auto fmt::formatter<canonical_mutation>::format(const canonical_mutation& cm, fm
auto&& entry = _cm.regular_column_at(id);
_os = fmt::format_to(_os, "column {} {}", bytes_to_text(entry.name()), atomic_cell::printer(*entry.type(), ac));
}
virtual void accept_row_cell(column_id id, collection_mutation_view cmv) override {
virtual void accept_row_cell(column_id id, collection_mutation cm) override {
print_separator();
auto&& entry = _cm.regular_column_at(id);
_os = fmt::format_to(_os, "column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cmv));
_os = fmt::format_to(_os, "column {} {}", bytes_to_text(entry.name()), collection_mutation_view::printer(*entry.type(), cm));
}
out_t finalize() {
if (_in_row) {

View File

@@ -7,12 +7,14 @@
*/
#include "utils/assert.hh"
#include "utils/on_internal_error.hh"
#include "types/collection.hh"
#include "types/user.hh"
#include "types/concrete_types.hh"
#include "mutation/mutation_partition.hh"
#include "compaction/compaction_garbage_collector.hh"
#include "combine.hh"
#include "idl/mutation.dist.impl.hh"
#include "collection_mutation.hh"
@@ -224,13 +226,26 @@ compact_and_expire_result collection_mutation_description::compact_and_expire(co
return res;
}
template <typename Iterator>
/// A CollectionMutationAdaptor is a static interface that adapts a collection
/// element (an iterator value type) to the serialization requirements of
/// serialize_collection_mutation(). It provides static methods to measure the
/// serialized sizes and to write the key and value of each element into a buffer.
template <typename Adaptor, typename Element>
concept CollectionMutationAdaptor = requires(const Element& e, managed_bytes_mutable_view& out) {
{ Adaptor::key_size(e) } -> std::convertible_to<size_t>;
{ Adaptor::value_size(e) } -> std::convertible_to<size_t>;
{ Adaptor::write_key(e, out) };
{ Adaptor::write_value(e, out) };
};
template <typename Adaptor, typename Iterator>
requires CollectionMutationAdaptor<Adaptor, std::iter_value_t<Iterator>>
static collection_mutation serialize_collection_mutation(
const abstract_type& type,
const tombstone& tomb,
std::ranges::subrange<Iterator> cells) {
auto element_size = [] (size_t c, auto&& e) -> size_t {
return c + 8 + e.first.size() + e.second.serialize().size();
return c + 8 + Adaptor::key_size(e) + Adaptor::value_size(e);
};
auto size = std::ranges::fold_left(cells, (size_t)4, element_size);
size += 1;
@@ -244,32 +259,112 @@ static collection_mutation serialize_collection_mutation(
write<int64_t>(out, tomb.timestamp);
write<int64_t>(out, tomb.deletion_time.time_since_epoch().count());
}
auto writek = [&out] (bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, single_fragmented_view(v));
auto writek = [&out] (auto& kv) {
write<int32_t>(out, Adaptor::key_size(kv));
Adaptor::write_key(kv, out);
};
auto writev = [&out] (managed_bytes_view v) {
write<int32_t>(out, v.size());
write_fragmented(out, v);
auto writev = [&out] (auto& kv) {
write<int32_t>(out, Adaptor::value_size(kv));
Adaptor::write_value(kv, out);
};
// FIXME: overflow?
write<int32_t>(out, std::ranges::distance(cells));
for (auto&& kv : cells) {
auto&& k = kv.first;
auto&& v = kv.second;
writek(k);
writev(v.serialize());
writek(kv);
writev(kv);
}
return collection_mutation(type, std::move(ret));
}
namespace {
/// A key-value pair where the key is bytes-like and the value is an atomic_cell-like type
/// with a serialize() method returning managed_bytes_view.
template <typename T>
concept AtomicCellKV = requires(const T& kv) {
{ kv.first.size() } -> std::convertible_to<size_t>;
{ kv.second.serialize() } -> std::convertible_to<managed_bytes_view>;
};
struct atomic_cell_adaptor {
static size_t key_size(const AtomicCellKV auto& v) { return v.first.size(); }
static size_t value_size(const AtomicCellKV auto& v) { return v.second.serialize().size(); }
static void write_key(const AtomicCellKV auto& v, managed_bytes_mutable_view& out) {
write_fragmented(out, single_fragmented_view(v.first));
}
static void write_value(const AtomicCellKV auto& v, managed_bytes_mutable_view& out) {
write_fragmented(out, v.second.serialize());
}
};
}
collection_mutation collection_mutation_description::serialize(const abstract_type& type) const {
return serialize_collection_mutation(type, tomb, std::ranges::subrange(cells.begin(), cells.end()));
return serialize_collection_mutation<atomic_cell_adaptor>(type, tomb, std::ranges::subrange(cells.begin(), cells.end()));
}
collection_mutation collection_mutation_view_description::serialize(const abstract_type& type) const {
return serialize_collection_mutation(type, tomb, std::ranges::subrange(cells.begin(), cells.end()));
return serialize_collection_mutation<atomic_cell_adaptor>(type, tomb, std::ranges::subrange(cells.begin(), cells.end()));
}
namespace {
struct serialized_cell_adaptor {
static size_t key_size(const ser::collection_element_view& v) {
return v.key().view().size_bytes();
}
static size_t value_size(const ser::collection_element_view& v) {
struct collection_cell_visitor {
size_t operator()(const ser::live_cell_view& lcv) const { return atomic_cell_type::live_serialized_size(lcv.value().view().size_bytes()); }
size_t operator()(const ser::expiring_cell_view& ecv) const { return atomic_cell_type::live_expiring_serialized_size(ecv.c().value().view().size_bytes()); }
size_t operator()(const ser::dead_cell_view& dcv) const { return atomic_cell_type::dead_serialized_size(); }
size_t operator()(const ser::counter_cell_view& ccv) const { utils::on_internal_error("Trying to deserialize counter cell from collection"); }
size_t operator()(const ser::unknown_variant_type&) const { utils::on_internal_error("Trying to deserialize cell in unknown state"); };
};
return boost::apply_visitor(collection_cell_visitor{}, v.value());
}
static void write_key(const ser::collection_element_view& v, managed_bytes_mutable_view& out) {
write_fragmented(out, v.key().view());
}
static void write_value(const ser::collection_element_view& v, managed_bytes_mutable_view& out) {
struct collection_cell_visitor {
managed_bytes_mutable_view& out;
void operator()(const ser::live_cell_view& lcv) const {
const auto v = lcv.value().view();
atomic_cell_type::write_live(out, lcv.created_at(), v);
out.remove_prefix(atomic_cell_type::live_serialized_size(v.size_bytes()));
}
void operator()(const ser::expiring_cell_view& ecv) const {
const auto v = ecv.c().value().view();
atomic_cell_type::write_live(out, ecv.c().created_at(), v, ecv.expiry(), ecv.ttl());
out.remove_prefix(atomic_cell_type::live_expiring_serialized_size(v.size_bytes()));
}
void operator()(const ser::dead_cell_view& dcv) const {
atomic_cell_type::write_dead(out, dcv.tomb().timestamp(), dcv.tomb().deletion_time());
out.remove_prefix(atomic_cell_type::dead_serialized_size());
}
void operator()(const ser::counter_cell_view& ccv) const {
utils::on_internal_error("Trying to deserialize counter cell from collection");
}
void operator()(const ser::unknown_variant_type&) const {
utils::on_internal_error("Trying to deserialize cell in unknown state");
}
};
boost::apply_visitor(collection_cell_visitor{out}, v.value());
}
};
}
collection_mutation read_from_collection_cell_view(const abstract_type& type, const ser::collection_cell_view& collection) {
auto tomb = collection.tomb();
auto cells = collection.elements();
return serialize_collection_mutation<serialized_cell_adaptor>(type, tomb, std::ranges::subrange(cells.begin(), cells.end()));
}
template <typename C>

View File

@@ -23,6 +23,10 @@ class row_tombstone;
class collection_mutation;
namespace ser {
class collection_cell_view;
}
// An auxiliary struct used to (de)construct collection_mutations.
// Unlike collection_mutation which is a serialized blob, this struct allows to inspect logical units of information
// (tombstone and cells) inside the mutation easily.
@@ -130,6 +134,12 @@ collection_mutation merge(const abstract_type&, collection_mutation_view, collec
collection_mutation difference(const abstract_type&, collection_mutation_view, collection_mutation_view);
// Transcode a collection from the IDL representation directly into the
// collection_mutation serialization format, without using any intermediary representation.
// Only the final collection-mutation blob is allocated, no intermediate allocations needed.
// Safe to use in LSA, it won't produce garbage.
collection_mutation read_from_collection_cell_view(const abstract_type&, const ser::collection_cell_view&);
// Serializes the given collection of cells to a sequence of bytes ready to be sent over the CQL protocol.
bytes_ostream serialize_for_cql(const abstract_type&, collection_mutation_view);

View File

@@ -97,9 +97,9 @@ public:
r.append_cell(id, atomic_cell_or_collection(std::move(cell)));
}
virtual void accept_static_cell(column_id id, collection_mutation_view collection) override {
virtual void accept_static_cell(column_id id, collection_mutation collection) override {
row& r = _static_row.maybe_create();
r.append_cell(id, collection_mutation(*_schema.static_column_at(id).type, std::move(collection)));
r.append_cell(id, std::move(collection));
}
virtual stop_iteration accept_row_tombstone(range_tombstone rt) override {
@@ -125,9 +125,9 @@ public:
r.append_cell(id, std::move(cell));
}
virtual void accept_row_cell(column_id id, collection_mutation_view collection) override {
virtual void accept_row_cell(column_id id, collection_mutation collection) override {
row& r = _current_row->cells();
r.append_cell(id, collection_mutation(*_schema.regular_column_at(id).type, std::move(collection)));
r.append_cell(id, std::move(collection));
}
auto on_end_of_partition() {

View File

@@ -707,9 +707,10 @@ struct fmt::formatter<shadowable_tombstone> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const shadowable_tombstone& t, FormatContext& ctx) const {
if (t) {
auto& tomb = t.tomb();
return fmt::format_to(ctx.out(),
"{{shadowable tombstone: timestamp={}, deletion_time={}}}",
t.tomb().timestamp, t.tomb(), t.tomb().deletion_time.time_since_epoch().count());
tomb.timestamp, tomb.deletion_time.time_since_epoch().count());
} else {
return fmt::format_to(ctx.out(),
"{{shadowable tombstone: none}}");

View File

@@ -86,37 +86,6 @@ atomic_cell read_atomic_cell(const abstract_type& type, atomic_cell_variant cv,
return boost::apply_visitor(atomic_cell_visitor(type, cm), cv);
}
collection_mutation read_collection_cell(const abstract_type& type, ser::collection_cell_view cv)
{
collection_mutation_description mut;
mut.tomb = cv.tomb();
auto&& elements = cv.elements();
mut.cells.reserve(elements.size());
visit(type, make_visitor(
[&] (const collection_type_impl& ctype) {
auto& value_type = *ctype.value_comparator();
for (auto&& e : elements) {
mut.cells.emplace_back(e.key(), read_atomic_cell(value_type, e.value(), atomic_cell::collection_member::yes));
}
},
[&] (const user_type_impl& utype) {
for (auto&& e : elements) {
bytes key = e.key();
auto idx = deserialize_field_index(key);
SCYLLA_ASSERT(idx < utype.size());
mut.cells.emplace_back(key, read_atomic_cell(*utype.type(idx), e.value(), atomic_cell::collection_member::yes));
}
},
[&] (const abstract_type& o) {
throw std::runtime_error(format("attempted to read a collection cell with type: {}", o.name()));
}
));
return mut.serialize(type);
}
template<typename Visitor>
void read_and_visit_row(ser::row_view rv, const column_mapping& cm, column_kind kind, Visitor&& visitor)
{
@@ -142,14 +111,7 @@ void read_and_visit_row(ser::row_view rv, const column_mapping& cm, column_kind
if (_col.is_atomic()) {
throw std::runtime_error("An atomic cell expected, got a collection");
}
// FIXME: Pass view to cell to avoid copy
auto&& outer = current_allocator();
with_allocator(standard_allocator(), [&] {
auto cell = read_collection_cell(*_col.type(), ccv);
with_allocator(outer, [&] {
_visitor.accept_collection(_id, cell);
});
});
_visitor.accept_collection(_id, read_from_collection_cell_view(*_col.type(), ccv));
}
void operator()(ser::unknown_variant_type&) const {
throw std::runtime_error("Trying to deserialize unknown cell type");
@@ -198,8 +160,8 @@ void mutation_partition_view::do_accept(const column_mapping& cm, Visitor& visit
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_static_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_static_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_static_cell(id, std::move(cm));
}
};
read_and_visit_row(mpv.static_row(), cm, column_kind::static_column, static_row_cell_visitor{visitor});
@@ -218,8 +180,8 @@ void mutation_partition_view::do_accept(const column_mapping& cm, Visitor& visit
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_row_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_row_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_row_cell(id, std::move(cm));
}
};
read_and_visit_row(cr.cells(), cm, column_kind::regular_column, cell_visitor{visitor});
@@ -240,8 +202,8 @@ future<> mutation_partition_view::do_accept_gently(const column_mapping& cm, Vis
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_static_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_static_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_static_cell(id, std::move(cm));
}
};
read_and_visit_row(mpv.static_row(), cm, column_kind::static_column, static_row_cell_visitor{visitor});
@@ -263,8 +225,8 @@ future<> mutation_partition_view::do_accept_gently(const column_mapping& cm, Vis
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_row_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_row_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_row_cell(id, std::move(cm));
}
};
read_and_visit_row(cr.cells(), cm, column_kind::regular_column, cell_visitor{visitor});
@@ -286,8 +248,8 @@ future<> mutation_partition_view::do_accept_gently(const column_mapping& cm, Asy
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_static_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_static_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_static_cell(id, std::move(cm));
}
};
read_and_visit_row(mpv.static_row(), cm, column_kind::static_column, static_row_cell_visitor{visitor});
@@ -308,8 +270,8 @@ future<> mutation_partition_view::do_accept_gently(const column_mapping& cm, Asy
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_row_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_row_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_row_cell(id, std::move(cm));
}
};
read_and_visit_row(cr.cells(), cm, column_kind::regular_column, cell_visitor{visitor});
@@ -337,8 +299,8 @@ mutation_partition_view::accept_ordered_result mutation_partition_view::do_accep
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_static_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_static_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_static_cell(id, std::move(cm));
}
};
read_and_visit_row(mpv.static_row(), cm, column_kind::static_column, static_row_cell_visitor{visitor});
@@ -376,8 +338,8 @@ mutation_partition_view::accept_ordered_result mutation_partition_view::do_accep
void accept_atomic_cell(column_id id, atomic_cell ac) const {
_visitor.accept_row_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) const {
_visitor.accept_row_cell(id, cm);
void accept_collection(column_id id, collection_mutation cm) const {
_visitor.accept_row_cell(id, std::move(cm));
}
};
read_and_visit_row(cr.cells(), cm, column_kind::regular_column, cell_visitor{visitor});
@@ -501,44 +463,40 @@ mutation_partition_view mutation_partition_view::from_view(ser::mutation_partiti
clustering_row read_clustered_row(const schema& s, ser::clustering_row_view crv) {
class clustering_row_builder {
const schema& _s;
clustering_row _row;
public:
clustering_row_builder(const schema& s, clustering_key key, row_tombstone t, row_marker m)
: _s(s), _row(std::move(key), std::move(t), std::move(m), row()) { }
clustering_row_builder(clustering_key key, row_tombstone t, row_marker m)
: _row(std::move(key), std::move(t), std::move(m), row()) { }
void accept_atomic_cell(column_id id, atomic_cell ac) {
_row.cells().append_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) {
_row.cells().append_cell(id, collection_mutation(*_s.regular_column_at(id).type, cm));
void accept_collection(column_id id, collection_mutation cm) {
_row.cells().append_cell(id, std::move(cm));
}
clustering_row get() && { return std::move(_row); }
};
auto cr = crv.row();
auto t = row_tombstone(cr.deleted_at(), shadowable_tombstone(cr.shadowable_deleted_at()));
clustering_row_builder builder(s, cr.key(), std::move(t), read_row_marker(cr.marker()));
clustering_row_builder builder(cr.key(), std::move(t), read_row_marker(cr.marker()));
read_and_visit_row(cr.cells(), s.get_column_mapping(), column_kind::regular_column, builder);
return std::move(builder).get();
}
static_row read_static_row(const schema& s, ser::static_row_view sr) {
class static_row_builder {
const schema& _s;
static_row _row;
public:
explicit static_row_builder(const schema& s)
: _s(s) { }
void accept_atomic_cell(column_id id, atomic_cell ac) {
_row.cells().append_cell(id, std::move(ac));
}
void accept_collection(column_id id, const collection_mutation& cm) {
_row.cells().append_cell(id, collection_mutation(*_s.static_column_at(id).type, cm));
void accept_collection(column_id id, collection_mutation cm) {
_row.cells().append_cell(id, std::move(cm));
}
static_row get() && { return std::move(_row); }
};
static_row_builder builder(s);
static_row_builder builder;
read_and_visit_row(sr.cells(), s.get_column_mapping(), column_kind::static_column, builder);
return std::move(builder).get();
}

View File

@@ -23,31 +23,31 @@ class converting_mutation_partition_applier;
template<typename T>
concept MutationViewVisitor = requires (T& visitor, tombstone t, atomic_cell ac,
collection_mutation_view cmv, range_tombstone rt,
collection_mutation cm, range_tombstone rt,
position_in_partition_view pipv, row_tombstone row_tomb,
row_marker rm) {
visitor.accept_partition_tombstone(t);
visitor.accept_static_cell(column_id(), std::move(ac));
visitor.accept_static_cell(column_id(), cmv);
visitor.accept_static_cell(column_id(), std::move(cm));
visitor.accept_row_tombstone(rt);
visitor.accept_row(pipv, row_tomb, rm,
is_dummy::no, is_continuous::yes);
visitor.accept_row_cell(column_id(), std::move(ac));
visitor.accept_row_cell(column_id(), cmv);
visitor.accept_row_cell(column_id(), std::move(cm));
};
template<typename T>
concept AsyncMutationViewVisitor = requires (T& visitor, tombstone t, atomic_cell ac,
collection_mutation_view cmv, range_tombstone rt,
collection_mutation cm, range_tombstone rt,
position_in_partition_view pipv, row_tombstone row_tomb,
row_marker rm) {
{ visitor.accept_partition_tombstone(t) } -> std::same_as<void>;
{ visitor.accept_static_cell(column_id(), std::move(ac)) } -> std::same_as<void>;
{ visitor.accept_static_cell(column_id(), cmv) } -> std::same_as<void>;
{ visitor.accept_static_cell(column_id(), std::move(cm)) } -> std::same_as<void>;
{ visitor.accept_row_tombstone(rt) } -> std::same_as<future<>>;
{ visitor.accept_row(pipv, row_tomb, rm, is_dummy::no, is_continuous::yes) } -> std::same_as<future<>>;
{ visitor.accept_row_cell(column_id(), std::move(ac)) } -> std::same_as<void>;
{ visitor.accept_row_cell(column_id(), cmv) } -> std::same_as<void>;
{ visitor.accept_row_cell(column_id(), std::move(cm)) } -> std::same_as<void>;
{ visitor.accept_end_of_partition() } -> std::same_as<future<>>;
};
@@ -56,11 +56,11 @@ public:
virtual ~mutation_partition_view_virtual_visitor();
virtual void accept_partition_tombstone(tombstone t) = 0;
virtual void accept_static_cell(column_id, atomic_cell ac) = 0;
virtual void accept_static_cell(column_id, collection_mutation_view cmv) = 0;
virtual void accept_static_cell(column_id, collection_mutation cm) = 0;
virtual stop_iteration accept_row_tombstone(range_tombstone rt) = 0;
virtual stop_iteration accept_row(position_in_partition_view pipv, row_tombstone rt, row_marker rm, is_dummy, is_continuous) = 0;
virtual void accept_row_cell(column_id, atomic_cell ac) = 0;
virtual void accept_row_cell(column_id, collection_mutation_view cmv) = 0;
virtual void accept_row_cell(column_id, collection_mutation cm) = 0;
};
// View on serialized mutation partition. See mutation_partition_serializer.

View File

@@ -46,8 +46,12 @@ public:
}
virtual void accept_static_cell(column_id id, collection_mutation_view collection) override {
accept_static_cell(id, collection_mutation(*_schema.static_column_at(id).type, std::move(collection)));
}
void accept_static_cell(column_id id, collection_mutation&& collection) {
row& r = _partition.static_row().maybe_create();
r.append_cell(id, collection_mutation(*_schema.static_column_at(id).type, std::move(collection)));
r.append_cell(id, std::move(collection));
}
virtual void accept_row_tombstone(const range_tombstone& rt) override {
@@ -72,8 +76,12 @@ public:
}
virtual void accept_row_cell(column_id id, collection_mutation_view collection) override {
accept_row_cell(id, collection_mutation(*_schema.regular_column_at(id).type, std::move(collection)));
}
void accept_row_cell(column_id id, collection_mutation collection) {
row& r = _current_row->cells();
r.append_cell(id, collection_mutation(*_schema.regular_column_at(id).type, std::move(collection)));
r.append_cell(id, std::move(collection));
}
};

View File

@@ -16,6 +16,7 @@ Usage:
import argparse, os, sys
from typing import Sequence
def read_statements(path: str) -> list[tuple[int, str]]:
stms: list[tuple[int, str]] = []
with open(path, 'r', encoding='utf-8') as f:

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:524c54493b72c5e1b783f14dfa49d733e21b24cc2ec776e9c6e578095073162d
size 6646304
oid sha256:8b22f9a548a03c88250d31e97ea3e8f77b4d90c502bcf74336c24056557f947f
size 6698412

View File

@@ -1,3 +1,3 @@
version https://git-lfs.github.com/spec/v1
oid sha256:fec2bb253d43139da954cee3441fc8bc74824246b080f23bf1f824714d0adc45
size 6646576
oid sha256:31e515a62f006649b0dc4671b51b2643fba9a70884c09b90fbc2237044954254
size 6707108

View File

@@ -239,7 +239,10 @@ private:
// Drop waiter that we lost track of, can happen due to a snapshot transfer,
// or a leader removed from cluster while some entries added on it are uncommitted.
void drop_waiters(std::optional<index_t> idx = {});
// When `snp` is provided (snapshot transfer case), waiters whose term matches
// the snapshot term are resolved successfully, since the snapshot-term match proves
// they were committed and included in the snapshot (by the Log Matching Property).
void drop_waiters(const snapshot_descriptor* snp = nullptr);
// Wake up all waiter that wait for entries with idx smaller of equal to the one provided
// to be applied.
@@ -556,12 +559,10 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
auto snap_term = _fsm->log_term_for(snap_idx);
SCYLLA_ASSERT(snap_term);
SCYLLA_ASSERT(snap_idx >= eid.idx);
if (type == wait_type::committed && snap_term == eid.term) {
if (snap_term == eid.term) {
logger.trace("[{}] wait_for_entry {}.{}: entry got truncated away, but has the snapshot's term"
" (snapshot index: {})", id(), eid.term, eid.idx, snap_idx);
co_return;
// We don't do this for `wait_type::applied` - see below why.
}
logger.trace("[{}] wait_for_entry {}.{}: entry got truncated away", id(), eid.term, eid.idx);
@@ -572,20 +573,6 @@ future<> server_impl::wait_for_entry(entry_id eid, wait_type type, seastar::abor
throw dropped_entry();
}
if (type == wait_type::applied && _fsm->log_last_snapshot_idx() >= eid.idx) {
// We know the entry was committed but the wait type is `applied`
// and we don't know if the entry was applied with `state_machine::apply`
// (we may've loaded a snapshot before we managed to apply the entry).
// As specified by `add_entry`, throw `commit_status_unknown` in this case.
//
// FIXME: replace this with a different exception type - `commit_status_unknown`
// gives too much uncertainty while we know that the entry was committed
// and had to be applied on at least one server. Some callers of `add_entry`
// need to know only that the current state includes that entry, whether it was done
// through `apply` on this server or through receiving a snapshot.
throw commit_status_unknown();
}
co_return;
}
}
@@ -760,6 +747,8 @@ future<> server_impl::add_entry(command command, wait_type type, seastar::abort_
throw not_a_leader{leader};
}
auto eid = co_await add_entry_on_leader(std::move(command), as);
co_await utils::get_local_injector().inject("block_raft_add_entry_before_wait_for_entry",
utils::wait_for_message(std::chrono::minutes(5)));
co_return co_await wait_for_entry(eid, type, as);
}
@@ -995,17 +984,24 @@ void server_impl::notify_waiters(std::map<index_t, op_status>& waiters,
}
}
void server_impl::drop_waiters(std::optional<index_t> idx) {
void server_impl::drop_waiters(const snapshot_descriptor* snp) {
auto drop = [&] (std::map<index_t, op_status>& waiters) {
while (waiters.size() != 0) {
auto it = waiters.begin();
if (idx && it->first > *idx) {
if (snp && it->first > snp->idx) {
break;
}
auto [entry_idx, status] = std::move(*it);
waiters.erase(it);
status.done.set_exception(commit_status_unknown());
_stats.waiters_dropped++;
if (snp && status.term == snp->term) {
// entry_idx <= snapshot index and the entry's term matches the snapshot term.
// By the Log Matching Property the entry was committed and included in the snapshot.
status.done.set_value();
_stats.waiters_awoken++;
} else {
status.done.set_exception(commit_status_unknown());
_stats.waiters_dropped++;
}
}
};
drop(_awaited_commits);
@@ -1431,7 +1427,7 @@ future<> server_impl::applier_fiber() {
// Apply snapshot it to the state machine
logger.trace("[{}] apply_fiber applying snapshot {}", _id, snp.id);
co_await _state_machine->load_snapshot(snp.id);
drop_waiters(snp.idx);
drop_waiters(&snp);
_applied_idx = snp.idx;
_applied_index_changed.broadcast();
_stats.sm_load_snapshot++;
@@ -1940,7 +1936,7 @@ std::unique_ptr<server> create_server(server_id uuid, std::unique_ptr<rpc> rpc,
}
std::ostream& operator<<(std::ostream& os, const server_impl& s) {
fmt::print(os, "[id: {}, fsm ()]\n", s._id, *s._fsm);
fmt::print(os, "[id: {}, fsm ({})]\n", s._id, *s._fsm);
return os;
}

View File

@@ -79,18 +79,18 @@ public:
// The caller may pass a pointer to an abort_source to make the operation abortable.
// If it passes nullptr, the operation is unabortable.
//
// Successful `add_entry` with `wait_type::committed` does not guarantee that `state_machine::apply` will be called
// locally for this entry. Between the commit and the application we may receive a snapshot containing this entry,
// so the state machine's state 'jumps' forward in time, skipping the entry application.
// However, for `wait_type::applied`, we guarantee that the entry will be applied locally with `state_machine::apply`.
// If a snapshot causes the state machine to jump over the entry, `add_entry` will return `commit_status_unknown`
// (even if the snapshot included that entry).
// Successful `add_entry` does not guarantee that `state_machine::apply` will be called
// locally for this entry. Between the commit and the application we may load a snapshot
// containing this entry, so the state machine's state 'jumps' forward in time, skipping
// the local entry application. For `wait_type::applied` this should be fine, because
// state machine implementations shouldn't care whether an entry was applied via
// `state_machine::apply` or via a snapshot load.
//
// Exceptions:
// raft::commit_status_unknown
// Thrown if the leader has changed and the log entry has either
// been replaced by the new leader or the server has lost track of it.
// It may also be thrown in case of a transport error while forwarding add_entry to the leader.L
// It may also be thrown in case of a transport error while forwarding add_entry to the leader.
// raft::dropped_entry
// Thrown if the entry was replaced because of a leader change.
// raft::request_aborted

View File

@@ -269,6 +269,10 @@ public:
// Gets the view a sstable currently belongs to.
compaction::compaction_group_view& view_for_sstable(const sstables::shared_sstable& sst) const;
utils::small_vector<compaction::compaction_group_view*, 3> all_views() const;
// Returns true iff v is the repaired view of this compaction group.
bool is_repaired_view(const compaction::compaction_group_view* v) const noexcept;
// Returns an sstable set containing only repaired sstables (those classified as repaired).
lw_shared_ptr<sstables::sstable_set> make_repaired_sstable_set() const;
seastar::condition_variable& get_staging_done_condition() noexcept {
return _staging_done_condition;
@@ -404,6 +408,8 @@ public:
// Make an sstable set spanning all sstables in the storage_group
lw_shared_ptr<const sstables::sstable_set> make_sstable_set() const;
// Like make_sstable_set(), but restricted to repaired sstables only across all compaction groups.
lw_shared_ptr<const sstables::sstable_set> make_repaired_sstable_set() const;
future<utils::chunked_vector<logstor::segment_snapshot>> take_logstor_snapshot() const;

View File

@@ -1006,7 +1006,7 @@ future<database::keyspace_change_per_shard> database::prepare_update_keyspace_on
co_await modify_keyspace_on_all_shards(sharded_db, [&] (replica::database& db) -> future<> {
auto& ks = db.find_keyspace(ksm.name());
auto new_ksm = ::make_lw_shared<keyspace_metadata>(ksm.name(), ksm.strategy_name(), ksm.strategy_options(), ksm.initial_tablets(), ksm.consistency_option(), ksm.durable_writes(),
ks.metadata()->cf_meta_data() | std::views::values | std::ranges::to<std::vector>(), ks.metadata()->user_types(), ksm.get_storage_options());
ks.metadata()->cf_meta_data() | std::views::values | std::ranges::to<std::vector>(), ks.metadata()->user_types(), ksm.get_storage_options(), ksm.next_strategy_options_opt());
auto change = co_await db.prepare_update_keyspace(ks, new_ksm, pending_token_metadata.local());
changes[this_shard_id()] = make_foreign(std::make_unique<keyspace_change>(std::move(change)));
@@ -1022,8 +1022,7 @@ void database::drop_keyspace(const sstring& name) {
static bool is_system_table(const schema& s) {
auto& k = s.ks_name();
return k == db::system_keyspace::NAME ||
k == db::system_distributed_keyspace::NAME ||
k == db::system_distributed_keyspace::NAME_EVERYWHERE;
k == db::system_distributed_keyspace::NAME;
}
sstables::sstables_manager& database::get_sstables_manager(const schema& s) const {
@@ -1142,7 +1141,7 @@ future<> database::create_local_system_table(
cfg.memtable_scheduling_group = default_scheduling_group();
cfg.memtable_to_cache_scheduling_group = default_scheduling_group();
}
auto lock = get_tables_metadata().hold_write_lock();
auto lock = co_await get_tables_metadata().hold_write_lock();
std::exception_ptr ex;
try {
add_column_family(ks, table, std::move(cfg), replica::database::is_new_cf::no);
@@ -1328,9 +1327,27 @@ future<global_table_ptr> get_table_on_all_shards(sharded<database>& sharded_db,
future<tables_metadata_lock_on_all_shards> database::lock_tables_metadata(sharded<database>& sharded_db) {
tables_metadata_lock_on_all_shards locks;
co_await sharded_db.invoke_on_all([&] (auto& db) -> future<> {
// Acquire write lock on shard 0 first, and then on the remaining shards.
//
// Parallel acquisition on all shards could deadlock when two
// fibers call lock_tables_metadata() concurrently: parallel_for_each
// sends SMP messages to all shards even when the local shard's lock
// attempt blocks. If task reordering (SEASTAR_SHUFFLE_TASK_QUEUE in
// debug/sanitize builds) causes fiber A to win on shard X while
// fiber B wins on shard Y, neither can make progress — classic
// cross-shard lock-ordering deadlock.
//
// Acquiring the write lock on shard 0 first, and then on the remaining
// shards, eliminates this: whichever fiber acquires shard 0 first is
// guaranteed to acquire locks on all other shards before the other fiber
// can acquire the lock on shard 0.
co_await sharded_db.invoke_on(0, [&locks, &sharded_db] (auto& db) -> future<> {
locks.assign_lock(co_await db.get_tables_metadata().hold_write_lock());
co_await sharded_db.invoke_on_others([&locks] (auto& db) -> future<> {
locks.assign_lock(co_await db.get_tables_metadata().hold_write_lock());
});
});
co_return locks;
}

View File

@@ -757,6 +757,10 @@ private:
// groups during tablet split with overlapping token range, and we need to include them all in a single
// sstable set to allow safe tombstone gc.
lw_shared_ptr<const sstables::sstable_set> sstable_set_for_tombstone_gc(const compaction_group&) const;
// Like sstable_set_for_tombstone_gc(), but restricted to repaired sstables only across all compaction
// groups of the same tablet (storage group). Used by the tombstone_gc=repair optimization to avoid
// scanning unrepaired sstables when looking for GC-blocking shadows.
lw_shared_ptr<const sstables::sstable_set> make_repaired_sstable_set_for_tombstone_gc(const compaction_group&) const;
bool cache_enabled() const {
return _config.enable_cache && _schema->caching_options().enabled();

Some files were not shown because too many files have changed in this diff Show More