Compare commits

...

231 Commits

Author SHA1 Message Date
Avi Kivity
a15294d601 Revert "Update seastar submodule"
This reverts commit 2943d30b0c. It
introduces a regression where --unsafe-bypass-fsync is not honored.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1496
2026-04-19 15:14:48 +03:00
Avi Kivity
9fb67e3e96 Revert "alternator: optional stripping of http response headers"
This reverts commit 73f0deef6d. It
prevents 2943d30b0c, which causes high flakiness, from being
reverted.
2026-04-19 15:14:48 +03:00
Szymon Malewski
73f0deef6d alternator: optional stripping of http response headers
In Alternator's HTTP API, response headers can dominate bandwidth for
small payloads. The Server, Date, and Content-Type headers were sent on
every response but many clients never use them.

This patch introduces three Alternator config options:
  - alternator_http_response_server_header,
  - alternator_http_response_disable_date_header,
  - alternator_http_response_disable_content_type_header,
which allow customizing or suppressing the respective HTTP response
headers. All three options support live update (no restart needed).
The Server header is no longer sent by default; the Date and
Content-Type defaults preserve the existing behavior.

The Server and Date header suppression uses Seastar's
set_server_header() and set_generate_date_header() APIs added in
https://github.com/scylladb/seastar/pull/3217. This patch also
fixes deprecation warnings from older Seastar HTTP APIs.

Tests are in test/alternator/test_http_headers.py.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-70

Closes scylladb/scylladb#28288
2026-04-19 09:22:04 +03:00
Nadav Har'El
f83270df12 Merge 'alternator/streams: Block tablet merges for Alternator Streams on tablet tables' from Piotr Szymaniak
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce two parents, making them incompatible with
Alternator Streams. This series blocks tablet merges when streams are
active on a tablet table.

For CreateTable, a freshly created table has no pending merges, so
streams are enabled immediately with tablet merges blocked.

For UpdateTable on an existing table, stream enablement is deferred:
the user's intent is stored via `enable_requested`, tablet merges are
blocked (new merge decisions are suppressed and any active merge
decision is revoked), and the topology coordinator finalizes enablement
once no in-flight merges remain.

The topology coordinator is woken promptly on error injection release
and tablet split completion, reducing finalization latency from ~60s
to seconds.

`test_parent_children_merge` is marked xfail (merges are now blocked),
and downward (merge) steps are removed from `test_parent_filtering` and
`test_get_records_with_alternating_tablets_count`.

Not addressed here: using a topology request to preempt long-running
operations like repair (tracked in SCYLLADB-1304).

Refs SCYLLADB-461

Closes scylladb/scylladb#29224

* github.com:scylladb/scylladb:
  topology: Wake coordinator promptly for stream enablement lifecycle
  test/cluster: Test deferred stream enablement on tablet tables
  alternator/streams: Block tablet merges when Alternator Streams are enabled
2026-04-19 09:15:13 +03:00
Piotr Szymaniak
a2a0868c7d topology: Wake coordinator promptly for stream enablement lifecycle
The topology coordinator sleeps on a condition variable between
iterations. Several events relevant to Alternator stream enablement
did not wake it, causing delays of up to 60s (the periodic load
stats refresh interval) at each step:

1. Error injection release: when a test disables the
   delay_cdc_stream_finalization injection, the coordinator was
   not notified. Add an on_disable callback mechanism to the error
   injection framework (register_on_disable / unregister_on_disable)
   so subsystems can react when an injection is released. The
   topology coordinator uses this to broadcast its event.

2. Tablet split completion: after all local storage groups for a
   table finish splitting, split_ready_seq_number is set but the
   coordinator only discovered this via the periodic stats refresh.
   Add an on_tablet_split_ready callback to topology_state_machine
   that the coordinator sets to trigger_load_stats_refresh(). The
   split monitor in storage_service calls it when all compaction
   groups are split-ready, giving the coordinator fresh stats
   immediately so it can finalize the resize.

These changes reduce test_deferred_stream_enablement_on_tablets
from ~120s to ~13s and fix a production issue where Alternator
stream enablement could be delayed by up to 60s at each step of
the lifecycle (error injection release, split completion).
2026-04-19 03:54:33 +02:00
Piotr Szymaniak
a5d35d2b4c test/cluster: Test deferred stream enablement on tablet tables
Async cluster test exercising the deferred enablement lifecycle:
ENABLING -> ENABLED -> disabled, verifying tablet merge blocking
and unblocking at each stage. Uses delay_cdc_stream_finalization
error injection and CQL ALTER TABLE with tablet count constraints.

Also adds tablet scheduler config to test_config.yaml (fast refresh
interval, scale factor 1) for reliable tablet count changes.
2026-04-19 03:54:33 +02:00
Piotr Szymaniak
4b6937b570 alternator/streams: Block tablet merges when Alternator Streams are enabled
DynamoDB Streams API can only convey a single parent per stream shard.
Tablet merges produce 2 parents, which is incompatible. When streams
are requested on a tablet table, block tablet merges via
tablet_merge_blocked (the allocator suppresses new merge decisions and
revokes any active merge decision).

add_stream_options() sets tablet_merge_blocked=true alongside
enabled=true, so CreateTable needs no special handling — the flag
is inert on vnode tables and immediately effective on tablet tables.

For UpdateTable, CDC enablement is deferred: store the user's intent
via enable_requested, and let the topology coordinator finalize
enablement once no in-progress merges remain. A new helper,
defer_enabling_streams_block_tablet_merges(), amends the CDC options
to this deferred state.

Disabling streams clears all flags, immediately re-allowing merges.

The tablet allocator accesses the merge-blocked flag through a
schema::tablet_merges_forbidden() accessor rather than reaching into
CDC options directly.

Mark test_parent_children_merge as xfail and remove downward
(merge) steps from tablet_multipliers in test_parent_filtering and
test_get_records_with_alternating_tablets_count.
2026-04-19 03:54:33 +02:00
Avi Kivity
f5886b4fdd Merge 'Add virtual task for vnodes-to-tablets migrations' from Nikos Dragazis
This PR exposes vnodes-to-tablets migrations through the task manager API via a virtual task. This allows users to list, query status, and wait on ongoing migrations through a standard interface, consistent with other global operations such as tablet operations and topology requests are already exposed.

The virtual task exposes all migrations that are currently in progress. Each migrating keyspace appears as a separate task, identified by a deterministic name-based (v3) UUID derived from the keyspace name. Progress is reported as the number of nodes that have switched to tablets vs. the total. The number increases on the forward path and decreases on rollback.

The task is not abortable - rolling back a migration requires a manual procedure.

The `wait` API blocks until the migration either completes (returning `done`) or is rolled back (returning `suspended`).

Example output:
```
$ scylla nodetool tasks list vnodes_to_tablets_migration
task_id                              type                        kind    scope    state   sequence_number keyspace table entity shard start_time end_time
1747b573-6cd6-312d-abb1-9b66c1c2d81f vnodes_to_tablets_migration cluster keyspace running 0               ks                    0

$ scylla nodetool tasks status 1747b573-6cd6-312d-abb1-9b66c1c2d81f
id: 1747b573-6cd6-312d-abb1-9b66c1c2d81f
type: vnodes_to_tablets_migration
kind: cluster
scope: keyspace
state: running
is_abortable: false
start_time:
end_time:
error:
parent_id: none
sequence_number: 0
shard: 0
keyspace: ks
table:
entity:
progress_units: nodes
progress_total: 3
progress_completed: 0
```

Fixes SCYLLADB-1150.

New feature, no backport needed.

Closes scylladb/scylladb#29256

* github.com:scylladb/scylladb:
  test: cluster: Verify vnodes-to-tablets migration virtual task
  distributed_loader: Link resharding tasks to migration virtual task
  distributed_loader: Make table_populator aware of migration rollbacks
  service: Add virtual task for vnodes-to-tablets migrations
  storage_service: Guard migration status against uninitialized group0
  compaction: Add parent_id to table_resharding_compaction_task_impl
  storage_service: Add keyspace-level migration status function
  storage_service: Replace migration status string with enum
  utils: Add UUID::is_name_based()
2026-04-19 00:56:33 +03:00
Nadav Har'El
2943d30b0c Update seastar submodule
* seastar 4d268e0e...22a5aa13 (36):
  > apps/httpd: replace deprecated reply::done() with write_body()
  > missing header(s)
  > net: Fix missing throw for runtime_error in create_native_net_device
  > tests/io_queue: account for token bucket refill granularity in bandwidth checks
  > Merge 'iovec: fix iovec_trim_front infinite loop on zero-length iovecs' from Travis Downs
    tests: add regression tests for zero-length iovec handling
    iovec: fix iovec_trim_front infinite loop on zero-length iovecs
  > util/process: graduate process management API from experimental
  > cooking: don't register ready.txt as a build output
  > sstring: make make_sstring not static
  > Add SparkyLinux to debian list in install-dependencies.sh
  > http: allow control over default response headers
  > Merge 'chunked_fifo: make cached chunk retention configurable' from Brandon Allard
    tests/perf: add chunked_fifo microbenchmarks
    chunked_fifo: set the default free chunk retention to 0
    chunked_fifo: make free chunk retention configurable
  > Merge 'reactor_backend: fix pollable_fd_state_completion reuse in io_uring' from Kefu Chai
    tests: add regression test for pollable_fd_state_completion reuse
    reactor_backend: use reset() in AIO and epoll poll paths
    reactor_backend: fix pollable_fd_state_completion reuse after co_await in io_uring
  > Merge 'coroutine: Generator cleanups' from Kefu Chai
    coroutine/generator: extract schedule_or_resume helper
    coroutine/generator: remove unused next_awaiter classes
    coroutine/generator: remove write-only _started field
    coroutine/generator: assert on unreachable path in buffered await_resume
    coroutine/generator: add elements_of tag and #include <ranges>
    coroutine/generator: add empty() to bounded_container concept
  > cmake: bump minimum Boost version to 1.79.0
  > seastar_test: remove unnecessary headers
  > cmake: bump minimum GnuTLS version to 3.7.4
  > Merge 'reactor: add get_all_io_queues() method' from Travis Downs
    tests: add unit test for reactor::get_all_io_queues()
    reactor: add get_all_io_queues() method
    reactor: move get_io_queue and try_get_io_queue to .cc file
  > http: deprecate reply::done(), remove _response_line dead field
  > core: Deprecate scattered_message
  > ci: add workflow dispatch to tests workflow
  > perf_tests: exit non-zero when -t pattern matches no tests
  > Replace duplicate SEGV_MAPERR check in sigsegv_action() with SEGV_ACCERR.
  > perf_tests: add total runtime to json output
  > Merge 'Relax large allocation error originating from json_list_template' from Robert Bindar
    implement move assignment operator for json_list_template
    json_list_template copy assignment operator reserves capacity upfront
  > perf_tests: add --no-perf-counters option
  > Merge 'Fix to_human_readable_value() ability to work with large values' from Pavel Emelyanov
    memory: Add compile-time test for value-to-human-readable conversion
    memory: Extend list of suffixes to have peta-s
    memory: Fix off-by-one in suffix calculation
    memory: Mark to_human_readable_value() and others constexpr
  > http: Improve writing of response_line() into the output
  > Merge 'websocket: add template parameter for text/binary frame mode and implement client-side WebSocket' from wangyuwei
    websocket: add template parameter for text/binary frame mode
    websocket: impl client side websocket function
  > file: Fix checks for file being read-only
  > reactor: Make do_dump_task_queue a task_queue method
  > Merge 'Implement fully mixed mode for output_stream-s' from Pavel Emelyanov
    tests/output_stream: sample type patterns in sanitizer builds
    tests/output_stream: extend invariant test to cover mixed write modes
    iostream: allow unrestricted mixing of buffered and zero-copy writes
    tests/output_stream: remove obsolete ad-hoc splitting tests
    tests/output_stream: add invariant-based splitting tests
    iostream: rename output_stream::_size to ::_buffer_size
  > reactor_backend: replace virtual bool methods with const bool_class members
  > resource: Avoid copying CPU vector to break it into groups
  > perf_tests: increase overhead column precision to 3 decimal places
  > Merge 'Move reactor::fdatasync() into posix_file_impl' from Pavel Emelyanov
    reactor: Deprecate fdatasync() method
    file: Do fdatasync() right in the posix_file_impl::flush()
    file: Propagate aio_fdatasync to posix_file_impl
    reactor: Move reactor::fdatasync() code to file.cc
    reactor,file: Make full use of file_open_options::durable bit
    file: Add file_open_options::durable boolean
    file: Account io_stats::fsyncs in posix_file_impl::flush()
    reactor: Move _fsyncs counter onto io_stats
  > http: Remove connection::write_body()
2026-04-18 11:52:33 +03:00
Nadav Har'El
31e0315710 Merge 'alternator: fix unnecesary cdc log entries' from Radosław Cybulski
Fix cdc writing unnecesary entries to it's log, like for example when Alternator deletes an item which in reality doesn't exist.

Originally @wps0 tackled this issue. This patch is an extension of his work. His work involved adding `should_skip` function to cdc, which would process a `mutation` object and decide, wherever changes in the object should be added to cdc log or not.

The issue with his approach is that `mutation` object might contain changes for more than one row. If - for example - the `mutation` object contains two changes, delete of non-existing row and create of non-existing row, `should_skip` function will detect changes in second item and allow whole `mutation` (BOTH items) to be added. For example (using python's boto3) running this on empty table:
```
with table.batch_writer() as batch:
    batch.put_item({'p': 'p', 'c': 'c0'})
    batch.delete_item(Key={'p': 'p', 'c': 'c1'})
```
will emit two events ("put" event and "delete" event), even though the item with `c` set to `c1` does not exist (thus can't be deleted). Note, that both entries in batch write must use the same partition key, otherwise upper layer with split them into separate `mutation` objects and the issue will not happen.

The solution is to do similar processing, but consider each change separated from others. This is tricky to implement due to a way cdc works. When cdc processes `mutation` object (containing X changes), it emits cdc entries in phases. Phase 1 - emit `preimage` (old state) for each change (if requested). Phase 2 - for each change emit actual "diff" (update / delete and so on). Phase 3 - emit `postimage` (new state).

We will know if change needs to be skipped during phase 2. By that time phase 1 is completed and preimage for the change is emited. At that moment we set a flag that the change (identified by clustering key value) needs to be skipped - we add a clustering key to a `ignore-rows` set (`_alternator_clustering_keys_to_ignore` variable) and continue normally. Once all phases finish we add a `postprocess` phase (`clean_up_noop_rows` function). It will go through generated cdc mutations and skip all modifications, for which clustering key is in `ignore-rows` set. After skipping we need to do a "cleanup" operation - each generated cdc mutation contain index (incremented by one), if we skipped some parts, the index is not consecutive anymore, so we reindex final changes.

There's a special case worth mentioning - Alternator tables without clustering keys. At that point `mutation` object passed to cdc can contain exactly one change (since different partition keys are splitted by upper layers and Alternator will never emit `mutation` object containing two (or more) changes with the same primary key. Here, when we decide the change is to be skipped we add empty `bytes` object to `ignore-rows` set. When checking `ignore-rows` set, we check if it's empty or not (we don't check for presence of empty `bytes` object).

Note: there might be some confusion between this patch and #28452 patch. Both started from the same error observation and use similar tests for validation, as both are easily triggered by BatchWrite commands (both needs `mutation` object passed to cdc to contain more than one single change). This issue tho is about wrong data written in cdc log and is fixed at cdc, where #28452 is about wrong way of parsing correct cdc data and is fixed at Alternator side of things. Note, that we need #28452 to truly verify (otherwise we will emit correct cdc entries, but Alternator will incorrectly parse them).

Note: to benefit / notice this patch you need `alternator_streams_increased_compatibility` flag turned on.

Note: rework is quite "broad" and covers a lot of ground - every operation, that might result in a no-change to the database state should be tested. An additional test was added - trying to remove a column from non-existing item, as well as trying to remove non-existing column from existing item.

Fixes: #28368
Fixes: SCYLLADB-1528
Fixes: SCYLLADB-538

Closes scylladb/scylladb#28544

* github.com:scylladb/scylladb:
  alternator: remove unnecesary code
  alternator: fix Alternator writing unnecesary cdc entries
  alternator: add failing tests for Streams
2026-04-18 00:07:51 +03:00
Nadav Har'El
32060d73df Merge 'alternator: Add stream support for tablets' from Radosław Cybulski
Implements neccesary changes for Streams to work with tablet based tables.

- add utility functions to `system_keyspace` that helps reading cdc content from cdc log tables for tablet based base tables (similar api to ones for vnodes)
- remove antitablet `if` checks, update tests that fail / skip if tablets are selected
- add two tests to extensively test tablet based version, especially while manipulating stream count

Fixes #23838
Fixes SCYLLADB-463

Closes scylladb/scylladb#28500

* github.com:scylladb/scylladb:
  alternator: add streams with tablets tests
  alternator: remove antitablet guards when using Streams
  alternator: implement streams for tablets
  treewide: add cdc helper functions to system_keyspace
  alternator: add system_keyspace reference
2026-04-17 23:48:31 +03:00
Radosław Cybulski
586bb1d345 alternator: fix issues with stream_arn copy / move
`stream_arn` object holds a full ARN as `std::string` and two
`std::string_view` fields (`table_name_` and `keyspace_name_`) pointing
into ARN itself. This prevents object from being safely copied
(as in that case both `table_name_` and `keyspace_name_` will point into
original object's ARN). Similar issue might happen with move, when
ARN contains string short enough for small string optimization to
kick in (although in practice this is not possible, as ARN has
requirements which make it's minimal length above 15 characteres -
current limit for small string optimizations in most popular string
libraries).

The patch drops `std::string_view` objects in favor of integer offsets
and sizes. The offset equal to 0 means beginning of ARN string. The api
is preserved - both `table_name` and `keyspace_name` function will
return `std::string_view` reconstructed on the fly.

Closes scylladb/scylladb#29507
2026-04-17 23:13:17 +03:00
Piotr Szymaniak
caaef45b7a audit: restore static_cast for batch inspect
Closes scylladb/scylladb#29545
2026-04-17 23:11:18 +03:00
Nikos Dragazis
d361a0dd83 test: cluster: Verify vnodes-to-tablets migration virtual task
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 21:13:52 +03:00
Nikos Dragazis
295e434781 distributed_loader: Link resharding tasks to migration virtual task
When a table is loaded on startup during a vnodes-to-tablets migration
(forward or rollback), the `table_populator` runs a resharding
compaction.

Set the migration virtual task as parent of the resharding task. This
enables users to easily find all node-local resharding tasks related to
a particular migration.

Make `migration_virtual_task::make_task_id()` public so that the
`distributed_loader` can compute the migration's task ID.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
a3aa4f6cb4 distributed_loader: Make table_populator aware of migration rollbacks
The `table_populator` uses a `migrate_to_tablets` flag to distinguish
normal tables from tables under vnodes-to-tablets migration (forward
path), since the two require different resharding.

The next patch will set the parent info of migration-related resharding
compaction tasks so they appear as children of the migration virtual
task. For that, the table populator needs to recognize not only
migrations in the forward path, but rollbacks as well.

Replace the flag with a tri-state `migration_direction` enum (none,
forward, rollback).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
696f9f8954 service: Add virtual task for vnodes-to-tablets migrations
Add a virtual task that exposes in-progress vnodes-to-tablets migrations
through the task manager API.

The task is synthesized from the current migration state, so completed
migrations are not shown. Progress is reported as the number of nodes
that currently use tablets: it increases on the forward path and
decreases on rollback. For simplicity, per-node storage modes are not
exposed in the task status; callers that need them should use the
migration status REST endpoint.

Unlike regular tasks that use time-based UUIDs, this task uses
deterministic named UUIDs derived from the keyspace names. This keeps
the implementation simple (no need to persist them) and gives each
keyspace a stable task ID. The downside is that the start time of each
task is unknown and repeated migrations of the same keyspace
(migration -> rollback -> new migration) cannot be distinguished.

Introduce a new task manager module to keep them separate from other
tasks.

Add support for `wait()`. While its practical value is debatable
(migration is a manual procedure, rolling restart will interrupt it), it
keeps the task consistent with the task manager interface.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
d1ca01b25d storage_service: Guard migration status against uninitialized group0
`storage_service::get_tablets_migration_status()` reads a group0 virtual
table, so it requires group0 to be initialized.

When invoked via the migration REST API, this condition is satisfied
since the API is only available after joining group0. However, once this
function is integrated into the task API later in this series, the
assumption will no longer hold, as the task API is exposed earlier in
the startup process.

Add a guard to detect this condition and return a clear error message.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
ca830c7bce compaction: Add parent_id to table_resharding_compaction_task_impl
Required to link it with the migration task in the next patches.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
46e3902daa storage_service: Add keyspace-level migration status function
`storage_service::get_tablets_migration_status()` returns the
keyspace-level migration status, indicating whether migration has not
started, is in progress, or has completed, and for migrating keyspaces
also returns per-node migration statuses. Rename it to
`get_tablets_migration_status_with_node_details()` and introduce a new
`get_tablets_migration_status()` that returns only the keyspace-level
status.

This prepares the function for reuse in the next patches, which will add
a virtual task for vnodes-to-tablets migrations. Several task-manager
paths will only need the keyspace-level migration state, not per-node
information.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
3096ba0577 storage_service: Replace migration status string with enum
Using a string was sufficient while this status was only exposed through
the REST API, but the next patches will also consume it internally.
Use an enum for the internal representation and convert it back to the
existing string values in the REST API.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:59:05 +03:00
Nikos Dragazis
a00056381f utils: Add UUID::is_name_based()
The UUID class already provides `is_timestamp()` for identifying
time-based (version 1) UUIDs. Add the analogous `is_name_based()`
predicate for version 3 (name-based) UUIDs, along with a test.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-17 20:58:39 +03:00
Radosław Cybulski
9a6aed721b alternator: add streams with tablets tests
Add tests for Streams, when table uses tablets underneath.
One test verifies filtering using CHILD_SHARDS feature.
Other one makes sure we get read all data while the table
undergoes tablet count change.

Add `--tablet-load-stats-refresh-interval-in-seconds=1`
to `alternator/run` script, as otherwise newly added tests will fail.
The setting changes how often scylla refreshes tablet metadata.
This can't be done using `scylla_config_temporary`, as
1) default is 60 seconds
2) scylla will wait full timeout (60s) to read configuration variable again.
2026-04-17 18:58:27 +02:00
Radosław Cybulski
6be16cf224 alternator: remove antitablet guards when using Streams
Remove `if` condition, that prevented tables with tablets
working with Streams.
Remove a test, that verifies, that Alternator will reject
tables with tablets underneath working with Streams feature enabled
on them.
Update few tests, that were expected to fail on tablets to enable their
normal execution.
2026-04-17 18:58:26 +02:00
Radosław Cybulski
d5df3ec07c alternator: implement streams for tablets
Add a code, that will handle Streams reading, when table is
using tablets underneath.

Fixes #23838
2026-04-17 18:57:44 +02:00
Radosław Cybulski
eb35a7b6ce treewide: add cdc helper functions to system_keyspace
Add helper functions to `system_keyspace` object, that deal
with reading cdc content for tablet based table's.
`read_cdc_for_tablets_current_generation_timestamp` will read current
generation's timestamp.
`read_cdc_for_tablets_versioned_streams` will build
timestamp -> `cdc::streams_version` map similar to how
`system_distributed_keyspace::cdc_get_versioned_streams` works.
We're adding those helper functions, because their siblings in
`system_distributed_keyspace` work only, when base table is backed up
by vnodes. New additions work only, when base table is backed up
by tablets.
2026-04-17 18:57:44 +02:00
Radosław Cybulski
d93299b605 alternator: add system_keyspace reference
Add a reference to `system_keyspace` object to `executor` object in
alternator. The reference is needed, because in future commit
we will add there (and use) helper functions that read `cdc_log` tables
for tablet based tables similarly to already existing siblings
for vnodes living in `system_distributed_keyspace`.
2026-04-17 18:57:43 +02:00
Radosław Cybulski
04b9d3875f alternator: remove unnecesary code
After our fix, that prevents no-op changes being written into cdc log
we will remove Piotr Wieczorek's previous attempt, which is now
unnecesary.
2026-04-17 18:02:00 +02:00
Radosław Cybulski
6e5aaa85b6 alternator: fix Alternator writing unnecesary cdc entries
Work in this patch is a result of two bugs - spurious MODIFY event, when
remove column is used in `update_item` on non-existing item and
spurious events, when batch write item mixed noop operations with
operations involving actual changes (the former would still emit
cdc log entries).
The latter issue required rework of Piotr Wieczorek's algorithm,
which fixed former issue as well.

Piotr Wieczorek previously wrote checks, that should
prevent unnecesary cdc events from being written. His implementation
missed the fact, that a single `mutation` object passed to cdc code
to be analysed for cdc log entries can contain modifications for
multiple rows (with the same timestamp - for example as a result
to BatchWriteItem call). His code tries to skip whole `mutation`,
which in such case is not possible, because BatchWriteItem might have
one item that does nothing and second item that does modification
(this is the reason for the second bug).

His algorithm was extended and moved. Originally it was working
as follows - user would sent a `mutation` object with some changes to
be "augmented". The cdc would process those changes and built a set of
cdc log changes based on them, that would be added to cdc log table.
Piotr added a `should_skip` function, which processes user changes and
tried to determine if they all should be dropped or not.
New version, instead of trying to skip adding rows to
cdc log `mutation` object, builds a rows-to-ignore set.
After whole cdc log `mutation` object is completed, it processes it
and go through it row by row. Any row that was previously added to
a `rows_to_ignore` set will now be removed. Remaining rows are written to
new cdc log `mutation` with new clustering key
(`cdc$batch_seq_no` index value should probably be consecutive -
we just want to be safe here) and returns new `mutation` object to
be sent to cdc log table.

The first bug is fixed as a side effect of new algorithm,
which contains more precise checks detecting, if given
mutation actually made a difference.

Fixes: #28368
Fixes: SCYLLADB-538
Fixes: SCYLLADB-1528
Refs: #28452
2026-04-17 18:00:25 +02:00
Botond Dénes
6ce0968960 compaction: release GC'ed sstables incrementally during compaction
Garbage collected sstables created during incremental compaction are
deleted only at the end of the compaction, which increases the memory
footprint. This is inefficient, especially considering that the related
input sstables are released regularly during compaction.

This commit implements incremental release of GC sstables after each
output sstable is sealed. Unlike regular input sstables, GC sstables
use a different exhaustion predicate: a GC sstable is only released
when its token range no longer overlaps with any remaining input
sstable. This is because GC sstables hold tombstones that may shadow
data in still-alive overlapping input sstables; releasing them
prematurely would cause data resurrection.

Fixes #5563

Closes scylladb/scylladb#28984
2026-04-17 18:20:47 +03:00
Radosław Cybulski
2894542e57 alternator: add failing tests for Streams
Add failing tests for Streams functionality.
Trying to remove column from non-existing item is producing
a MODIFY event (while it should none).
Doing batch write with operations working on the same partition,
where one operation is without side effects and second with
will produce events for both operations, even though first changes nothing.

First test has two versions - with and without clustering key.
Second has only with clustering key, as we can't produce
batch write with two items for the same partition -
batch write can't use primary key more than once in single call.
We also add a test for batch write, where one of three operations
has no observable side effects and should not show up in Streams
output, but in current scylla's version it does show.
2026-04-17 16:28:14 +02:00
Botond Dénes
6eb2d15f39 Merge 'Replace CAS estimated histogram with estimated_histogram_with_max' from Amnon Heiman
ScyllaDB uses estimated_histogram in many places.
We already have a more efficient alternative: estimated_histogram_with_max. It is both CPU- and
memory-efficient, and it can be exported as Prometheus native histograms.

Its main limitation (which also has benefits) is that the bucket layout is fixed at compile time, so
histograms with different configurations cannot be mixed.

The end goal is to replace all uses of estimated_histogram in the codebase.
That migration requires a few small API adjustments, so it is done in steps.

This PR replaces estimated_histogram for CAS contention.
The PR includes a patch that adds functionality to the base approx_exponential_histogram, which will be used by the API.

The specific histograms are defined in a single place and cover the range 1-100; this makes future changes easy.

**New feature, no need to backport**

Closes scylladb/scylladb#29017

* github.com:scylladb/scylladb:
  storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
  estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
2026-04-17 13:12:59 +03:00
Andrzej Jackowski
e256d9f69d test: retry get_coordinator_host() after topology coordinator stop
After stopping the topology coordinator, a new topology coordinator
may not yet be started when get_coordinator_host() is called.  Make
the function always retry via wait_for so that every caller is
protected against this race.

Fixes SCYLLADB-1553

Closes scylladb/scylladb#29489
2026-04-17 12:08:26 +02:00
Botond Dénes
fbcfe3f88f test: use uuid4 for DockerizedServer container names to avoid collisions
Container names were generated as {name}-{pid}-{counter}, where the
counter is a per-process itertools.count. This scheme breaks across CI
runs on the same host: if a prior job was killed abruptly (SIGKILL,
cancellation) its containers are left running since --rm only removes
containers on exit. A subsequent run whose worker inherits the same PID
(common in containerized CI with small PID namespaces) and reaches the
same counter value will collide with the orphaned container.

Replace pid+counter with uuid.uuid4(), which generates a random UUID,
making names unique across processes, hosts, and time without any shared
state or leaking host identifiers.

Fixes: SCYLLADB-1540

Closes scylladb/scylladb#29509
2026-04-17 11:56:51 +02:00
Botond Dénes
57f8be49e9 Merge 'Move ignore_component_digest_mismatch flag on sstables_manager' from Pavel Emelyanov
The PR serves two purposes.

First, it makes the flag usage be consistent across multiple ways to load sstables components. For example, the sstable::load_metadata() doesn't set it (like .load() does) thus potentially refusing to load "corrupted" components, as the flag assumes.

Second, it removes the fanout of db.get_config().ignore_component_digest_mismatch() over the code. This thing is called pretty much everywhere to initialize the sstable_open_config, while the option in question is "scylla state" parameter, not "sstable opening" one.

Code cleanup, not backporting

Closes scylladb/scylladb#29513

* github.com:scylladb/scylladb:
  sstables: Remove ignore_component_digest_mismatch from sstable_open_config
  sstables: Move ignore_component_digest_mismatch initialization to constructor
  sstables: Add ignore_component_digest_mismatch to sstables_manager config
2026-04-17 12:54:17 +03:00
Avi Kivity
cad3c0de94 test: write minio log to testlog dir for Jenkins artifact collection
Write the MinIO server log directly to tempdir_base (testlog/<arch>/)
instead of the per-server temp directory that gets destroyed on
shutdown. This preserves the log for Jenkins artifact collection,
helping debug S3-related flaky test failures like the
stcs_reshape_overlapping_s3_test hang (SCYLLADB-1481).

Closes scylladb/scylladb#29458
2026-04-17 12:51:55 +03:00
Botond Dénes
facb50cbf9 Merge 'test.py: refactor test.py' from Andrei Chekun
With the latest changes, there are a lot of code that is redundant in the test.py. This PR just cleans this code.
Also, it narrows using dynamic scope for fixtures to test/alternator and test/cqlpy. All the rest by default will have module scope.
test.py will be a wrapper for pytest mostly for CI use. As for now test.py have important part of calculating the number of threads to start pytest with. This is not possible to do in pytest itself.

No backport needed, framework enhancement only.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-666

Closes scylladb/scylladb#28852

* github.com:scylladb/scylladb:
  test.py: remove testpy_test_fixture_scope
  test.py: add logger for 3rd party service
  test.py: delete dead code in test.py
2026-04-17 12:51:14 +03:00
Pawel Pery
7883f161bb vector-store: fix creating local vector search indexes with a part of the partition key
Users ought to have possibility to create the local index for Vector Search
based only on a part of the partition key. This commits provides this by
removing requirements of 'full partition key only' for custom local index.

The commit updates docs to explain that local vector index can use only a part
of the partition key.

The commit implements cqlpy test to check fixed functionality.

Fixes: SCYLLADB-953

Needs to be backported to 2026.1 as it is a fix for local vector indexes.

Closes scylladb/scylladb#28931
2026-04-17 11:44:15 +02:00
Karol Nowacki
c643f321af vector_search: decrease default connection timeout to 3s
Decrease the default connection timeout to 3s to better align with the
default CQL query timeout of 10s.

The previous timeout allowed only one failover request in high availability
scenario before hitting the CQL query timeout.
By decreasing the timeout to 3s, we can perform up to three failover requests
within the CQL query timeout, which significantly improves the chances of
successfully completing the query in high availability scenarios.

Fixes: SCYLLADB-95
2026-04-17 12:26:39 +03:00
Karol Nowacki
9269ca9cf7 vector_search: add unreachable node detection time config
Add option `vector_store_unreachable_node_detection_time_in_ms` to
control parameters related to detecting unreachable vector store nodes.
This parameter is used to set the TCP connect timeout, keepalive
parameters, and TCP_USER_TIMEOUT. By configuring these parameters,
we can detect unreachable vector store nodes faster and trigger
failover mechanisms in a timely manner.
2026-04-17 12:26:38 +03:00
Piotr Smaron
686029f52c audit: disable caching for the audit log table
The audit table had caching enabled by default, which provides no
value since audit data is write-heavy and rarely read back through
the cache. This wastes cache space that could be used for more
important user data.

Disable caching by setting keys and rows_per_partition to NONE and
enabled to false, consistent with get_disabled_caching_options()
and other system tables such as system.batchlog,
system.large_partitions, and CDC log tables.

Closes scylladb/scylladb#29506
2026-04-17 11:17:10 +02:00
Piotr Dulikowski
37fc1507f0 Merge 'Alternator: Add vector search support' from Nadav Har'El
This series adds support for vector search in Alternator based on the existing implementation in CQL.

The series adds APIs for `CreateTable` and `UpdateTable` to add or remove vector indexes to Alternator tables, `DescribeTable` to list them and check the indexing status, and `Query` to perform a vector search - which contacts the vector store for the actual ANN (approximate nearest neighbor) search.

Correct functionality of these features depend on some features of the the vector store, that were already done (see https://github.com/scylladb/vector-store/pull/394).

This initial implementation is fully functional, and can already be useful, but we do not yet support all the features we hope to eventually support. Here are things that we have **not** done yet, and plan to do later in follow-up pull requests:

1. Support a new optimized vector type ("V") - in addition to the "list of numbers" type supported in this version.
2. Allow choosing a different similarity function when creating an index, by SimilarityFunction in VectorIndex definition.
3. Allow choosing quantization (f32/f16/bf16/i8/b1) to ask the vector index to compress stored vectors.
4. Support oversampling and rescoring, defined per-index and per-query.
5. Support HNSW tuning parameters — maximum_node_connections, construction_beam_width, search_beam_width.
6. Support pre-filtering over key columns, which are available at the vector store, by sending the filter to the vector store (translated from DynamoDB filter syntax to the vector's store's filter syntax). A decision still need to be made if this will use KeyConditionExpression or FilterExpression. This version supports only post-filtering (with `FilterExpression`).
7. Support projecting non-key attributes into the index (Projection=INCLUDE and Projection=ALL), and then 1. pre-filtering using these attributes, and 2. efficiently return these attributes (using Select=ALL_PROJECTED_ATTRIBUTES, which today returns just the key columns).
8. Optimize the performance of `Query`, which today is inefficient for Select=ALL_ATTRIBUTES because it serially retrieves the matching items one at a time.
9. Returning the similarity scores with the items (the design proposes ReturnVectorSearchSimilarity).
10. Add more vector-search-specific metrics, beyond the metric we already have counting Query requests. For example separate latency and request-count metrics for vector-search Queries (distinct from GSI/LSI queries), and a metric accumulating the total Limit (K) across all vector search queries.
11. Consider how (and if at all) we want to run the tests in test/alternator/test_vector.py that need the vector store in the CI. Currently they are skipped in CI and only run manually (with `test/alternator/run --vs test_vector`).
12. UpdateTable 'Update' operation to modify index parameters. Only some can be modified, e.g., Oversampling.
13. Support for "local index" (separate index for each partition).
14. Make sure that vector search and Streams can be enabled concurrently on the same table - both need CDC but we need to verify that one doesn't confuse the other or disables options that the other needs. We can only do this after we have Alternator Streams running on tablets (since vector store requires tablets).

Testing the new Alternator vector search end-to-end requires running both Scylla and the vector store together. We will have such end-to-end tests in the vector store repository (see https://github.com/scylladb/vector-store/pull/392), but we also add in this pull request many end-to-end tests written in Python, that can be run with the command "test/alternator/run --vs test_vector.py". The "--vs" option tells the run script to run both Scylla and the vector store (currently assumed to be in `.../vector-store/target/release/vector-store`). About 65% of the tests in this pull request check supported syntax and error paths so can run without the vector store, while about 35% of the tests do perform actual Query operations and require the vector store to be running. Currently, the tests that do require the vector store will not get run by CI, but can be easily re-run manually with `test/alternator/run --vs test_vector.py`.

 In total, this series includes 78 functional tests in 2200 lines of Python code.

This series also includes documentation for the new Alternator feature and the new APIs introduced. You can see a more detailed design document here: https://docs.google.com/document/d/1cxLI7n-AgV5hhH1DTyU_Es8_f-t8Acql-1f58eQjZLY/edit

Two patches in this series split the huge alternator/executor.cc, after this series continued to grow it and it reached a whoppng 7,000 lines. These patches are just reorganization of code, no functional changes. But it's time that we finally do this (Refs #5783), we can't just continue to grow executor.cc with no end...

Closes scylladb/scylladb#29046

* github.com:scylladb/scylladb:
  test/alternator: add option to "run" script to run with vector search
  alternator: document vector search
  test/alternator: fix retries in new_dynamodb_session
  test/alternator: test for allowed characters in attribute names
  test/alternator: tests for vector index support
  alternator, vector: add validation of non-finite numbers in Query
  alternator: Query: improve error message when VectorSearch is missing
  alternator: add per-table metrics for vector query
  alternator: clean up duplicated code
  alternator: fix default Select of Query
  alternator: split executor.cc even more
  alternator: split alternator/executor.cc
  alternator: validate vector index attribute values on write
  alternator: DescribeTable for vector index: add IndexStatus and Backfilling
  alternator: implement Query with a vector index
  alternator: fix bug in describe_multi_item()
  alternator: prevent adding GSI conflicting with a vector index
  alternator: implement UpdateTable with a vector index
  alternator: implement DescribeTable with a vector index
  alternator: implement CreateTable with a vector index
  alternator: reject empty attribute names
  cdc: fix on_pre_create_column_families to create CDC log for vector search
2026-04-17 10:25:45 +02:00
Avi Kivity
04b54f363b Merge 'Enable vnodes-to-tablets migrations with arbitrary tokens' from Nikos Dragazis
This PR removes the power-of-two token constraint from vnodes-to-tablets migrations, allowing clusters with randomly generated tokens to migrate without manual token reassignment.

Previously, migrations required vnode tokens to be a power of two and aligned. In practice, these conditions are not met with Scylla's default random token assignment, so the constraint is a blocker for real-world use. With the introduction of arbitrary tablet boundaries in PR #28459, the tablet layer can now support arbitrary tablet boundaries. This PR builds on that capability to allow arbitrary vnode tokens during migration.

When the highest vnode token does not coincide with the end of the token ring, the vnode wraps around, but tablets do not support that. This is handled by splitting it into two tablets: one covering the tail end of the ring and one covering the beginning.

Testing has been updated accordingly: existing cluster tests now use randomly generated tokens instead of precomputed power-of-two values, and a new Boost test validates the wrap-around tablet boundary logic.

Fixes SCYLLADB-724.

New feature, no backport is needed.

Closes scylladb/scylladb#29319

* github.com:scylladb/scylladb:
  test: Use arbitrary tokens in vnodes->tablets migration tests
  test: boost: Add test for wrap-around vnodes
  storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
  storage_service: Hoist migration precondition
2026-04-17 00:46:35 +03:00
Andrei Chekun
745debe9ec test.py: remove testpy_test_fixture_scope
With migration to pyest this fixture is useless. Removing and setting
the session to the module for the most of the tests.
Add dynamic_scope function to support running alternator fixtures in
session scope, while Test and TestSuite are not deleted. This is for
migration period, later on this function should be deleted.
2026-04-16 22:08:33 +02:00
Andrei Chekun
21addb2173 test.py: add logger for 3rd party service
With migration of preparation environment and starting 3rd party services
to the pytest, they're output the logs to the terminal. So this PR
binds them their own log file to avoid polluting the terminal.
2026-04-16 22:08:33 +02:00
Andrei Chekun
13770ab394 test.py: delete dead code in test.py
With the latest changes, there are a lot of code that is redundant in
the test.py. This PR just cleans this code.
Changes in other files are related to cleaning code from the test.py,
especially with redundant parameter --test-py-init and moving
prepare_environment to pytest itself.
2026-04-16 22:08:31 +02:00
Avi Kivity
999e108139 Merge 'test: lib: fix broken retry in start_docker_service' from Dario Mirovic
The retry loop in `start_docker_service` passes the parse callbacks via `std::move` into `create_handler` on each iteration. After the first iteration, the moved-from `std::function` objects are empty. All subsequent retries skip output parsing entirely and immediately treat the service as successfully started. This defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the original callbacks remain valid across retries.

Fixes SCYLLADB-1542

This is a CI stability issue and should be backported.

Closes scylladb/scylladb#29504

* github.com:scylladb/scylladb:
  test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
  test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
  test: fix proc_utils.cc formatting from previous commit
  test: lib: use unique container name per retry attempt
  test: lib: fix broken retry in start_docker_service
2026-04-16 21:48:25 +03:00
Radosław Cybulski
c5ed6b22ae alternator: add CHILD_SHARDS filtering
Add a `CHILD_SHARDS` filter to `DescribeStream` command.
When used, user need to pass a parent stream shard id as
json's ShardFilter.ShardId field. DescribeStream will
then return only list of stream shards, that are direct
descendants of passed parent stream shard.

Each stream shard cover a consecutive part of token space.
A stream shard Q is considered to be a child of stream shard W,
when at least one token belongs to token spaces from both streams.
The filtering algorithm itself is somewhat complicated - more details
in comments in streams.cc.

CHILD_SHARDS is a Amazon's functionality and is required by KCL.

Add unit tests.

Fixes: #25160

Closes scylladb/scylladb#28189
2026-04-16 18:27:55 +03:00
Andrei Chekun
ba04e1e2c3 codeowners: add owner for the test framework
Add @xtrey as a codeowner of the test framework

Closes scylladb/scylladb#29518
2026-04-16 17:57:21 +03:00
Piotr Szymaniak
d0c3f78d76 test/alternator: extend local TTL streams timeout
Increase the non-AWS wait in the TTL streams test to reduce vnode CI flakes caused by delayed expiration visibility.

Fixes SCYLLADB-1556

Closes scylladb/scylladb#29516
2026-04-16 15:53:35 +03:00
copilot-swe-agent[bot]
ec7450bff8 topology_coordinator, tablets: Log active tablet transitions when going idle
This will make debugging of stalled tablet transitions easier. We saw
several issues when topology state machine was blocked by active
tablet migrations, which was not obvious at first glance of the
logs. Now it will be east to tell if tablet transitions are blocking
progress and which transitions are stuck.

Closes scylladb/scylladb#28616
2026-04-16 14:34:37 +03:00
Benny Halevy
05a00fe140 compaction_manager: fix use-after-free in postponed_compactions_reevaluation()
drain() signals the postponed_reevaluation condition variable to terminate
the postponed_compactions_reevaluation() coroutine but does not await its
completion. When enable() is called afterwards, it overwrites
_waiting_reevalution with a new coroutine, orphaning the old one. During
shutdown, really_do_stop() only awaits the latest coroutine via
_waiting_reevalution, leaving the orphaned coroutine still alive. After
sharded::stop() destroys the compaction_manager, the orphaned coroutine
resumes and reads freed memory (is_disabled() accesses _state).

Fix by introducing stop_postponed_compactions(), awaiting the reevaluation coroutine in
both drain() and stop() after signaling it, if postponed_compactions_reevaluation() is running.
It uses an std::optional<future<>> for _waiting_reevalution and std::exchange to leave
_waiting_reevalution disengaged when postponed_compactions_reevaluation() is not running.
This prevents a race between drain() and stop().

While at it, fix typo in _waiting_reevalution -> _waiting_reevaluation.

Fixes: SCYLLADB-1463
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#29443
2026-04-16 14:33:31 +03:00
Nadav Har'El
d3d5db37d7 test/alternator: add option to "run" script to run with vector search
Add to test/alternator/run the option "-vs" which runs alongside with
Scylla a vector store, to allow running Alternator tests with vector
indexing.

To get the vector store, do

	git clone git@github.com:scylladb/vector-store.git
	cargo build --release

"run -vs" looks for an executable in ../vector-store/target/*/vector-store
but can also be overridden by the VECTOR_STORE environment variable.

test/alternator/run runs the vector store exactly like it runs Scylla -
in a temporary directory, on a temporary IP address in the localhost
subnet (127.0.0/8), killing it when the test end, and showing the output
of both programs (Scylla and vector store). These transient runs of
Scylla and vector store are configured to be able to communicate to
each other.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:18 +03:00
Nadav Har'El
3d8463ccd2 alternator: document vector search
This patch adds a new document, docs/alternator/vector-search.md, on the
new vector search feature in Alternator. It introduces this feature, and
the DynamoDB APIs that we extended to support it.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
164b0e37e1 test/alternator: fix retries in new_dynamodb_session
The new_dynamodb_session() function had a bug which we never noticed
because we hardly used it, but it became more noticable when the new
test/alternator/test_vector.py started to use it:

By default, boto3 retries a request up to 9 times when it encounters
a retriable error (such as an Internal Server Error). We don't want such
retries in our tests - it makes failures slower, but more importantly
it can hide "flaky" bugs by retrying 9 times until it happens to succeed.

The new_dynamodb_session() had code (copied from the dynamodb fixture)
to set boto3's "max_attempts" configuration to 0, to disable this retry.
But this code had an incorrect "if" to only be done if we're testing on
"localhost". This is wrong: We almost never use "localhost" as the
target of the test; Both test/cqlpy/run and test.py pick an IP address
in the localhost subnet (127/8) and uses that IP address - not the string
"localhost".

This bug only existed in new_dynamodb_session() - the more commonly used
"dynamodb" fixture didn't have this bug.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
858dee0b30 test/alternator: test for allowed characters in attribute names
One of the tests in the previous patch checked that strange characters
are allowed in attribute names used for vector indexing. It turns out
we never had a test that verifies that regardless of vector indexes -
any character whatsoever is allowed in attribute names. This is
different from table names which are much more limited.

So this patch adds the missing test.

As usual, the new test also passes on DynamoDB, showing that these
stange characters in attribute names are also allowed by DynamoDB.
2026-04-16 14:30:17 +03:00
Nadav Har'El
58538e18e8 test/alternator: tests for vector index support
In this patch we add a large collection of basic functional tests for the
vector index support, covering the CreateTable, UpdateTable,  DescribeTable
and Query operations and the various ways in which those are allowed to
work - or expected to fail. These tests were written in parallel with
writing the code so they (hopefully) cover all the corner cases considered
during development, and make sure these corner cases are all handled
correctly and will not regress in the future.

Some of these tests do not involve querying of the index and focus on
the structure of requests and the kind of syntax allowed. But other tests
are end-to-end, requiring the vector store to be running and trying to
index Alternator data and query it. These tests are marked
"needs_vector_store", and are immediately skipped in Scylla is not
configured to connect to a vector store. In a later patch we'll add a
an option to test/alternator/run to be able to run these end-to-end
tests by automatically running both Scylla and the Vector Store.

We'll have additional end-to-end tests in the vector-store repository.

Note that vector search is a new API feature that doesn't exist in DynamoDB,
so we are adding new parameters and outputs to existing operations. The AWS
SDKs don't normally allow doing that, so the test added here begins by
teaching the Python SDK to use the new APIs we added. This piece of code
can also be used by end-users to use vector search (at least in Python...)
before we officially add this support to ScyllaDB's SDK wrappers.
2026-04-16 14:30:17 +03:00
Nadav Har'El
fe5a5a813f alternator, vector: add validation of non-finite numbers in Query
Non-finite numbers (Inf, NaN) don't make sense in vector search, and
also not allowed in the DynamoDB API as numbers. But the parsing code
in Query's QueryVector accepted "Inf" and "NaN" and then failed to
send the request to the vector store, resulting in a strange error
message. Let's fix it in the parsing code.

We have a test (test_query_vectorsearch_queryvector_bad_number_string)
that verifies this fix.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:17 +03:00
Nadav Har'El
aa070fae5b alternator: Query: improve error message when VectorSearch is missing
Before this patch, if we attempt a Query with IndexName is a vector index
but forget a "VectorSearch" parameter, the error is misleading: The code
expects a GSI or LSI, and when it can't find a GSI or LSI with that name,
it reports that the index is missing. But this is not helpful. So in this
patch we produce a more helpful message: That the index does exist, and
is a vector index, so a "VectorSearch" parameter is mandatory and is
missing.
2026-04-16 14:30:16 +03:00
Nadav Har'El
f932f94422 alternator: add per-table metrics for vector query
The per-table metrics for Query were not incremented for the
vector variant of the Query operations, only the global metrics were
incremented. This patch fixes this oversight, and add a test that
reproduces it (the new test fails before this patch, and passes after).
2026-04-16 14:30:16 +03:00
Nadav Har'El
8cf510e06c alternator: clean up duplicated code
De-duplicate some code introduced in earlier patches, such a two
nearly-identical loops over the indexes (one to check if there is a
vector index, the second to get its dimensions), and two nearly-
identical chunks of code to get the item contents when there is or
there isn't a clustering key.

There should be no functional changes in this patch.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
f15c6634a7 alternator: fix default Select of Query
In earlier patches, when Query'ing a vector index, we set the default
Select to ALL_ATTRIBUTES. However, according to the DynamoDB documentation
for Query,

   "If neither Select nor ProjectionExpression are specified, DynamoDB
    defaults to ALL_ATTRIBUTES when accessing a table, and
    ALL_PROJECTED_ATTRIBUTES when accessing an index."

This default should also apply to vector index, so this patch fixes this.
The new behavior is not only more compatible with DynamoDB, it is also
much more efficient by default, as ALL_PROJECTED_ATTRIBUTES does not need
to read from the base table - it returns the results that the vector store
returned. Of course, if the user needs the more efficient ALL_ATTRIBUTES
this option is still available - it's just no longer the default.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
2e274bbdba alternator: split executor.cc even more
This patch continues the effort to split the huge executor.cc (5000
lines before this patch) even more.

In this patch we introduce a new source file, executor_util.cc, for
various utility functions that are used for many different operations
and therefore are useful to have in a header file. These utility
functions will now be in executor_util.cc and executor_util.hh -
instead of executor.cc and executor.hh.

Various source files, including executor.cc, the executor_read.cc
introduced in the previous patch, as well as older source files like
as streams.cc, ttl.cc and serialization.cc, use the new header file.

This patch removes over 700 lines of code from executor.cc, and
also removes a large amount of utility functions declerations from
executor.hh. Originally, executor.hh was meant to be about the
interface that the Alternator server needs to *execute* the different
DynamoDB API operations - and after this patch it returns closer to
this original goal.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:16 +03:00
Nadav Har'El
751da00692 alternator: split alternator/executor.cc
Already six years ago, in #5783, we noticed that alternator/executor.cc
has grown too large. The previous patches added hundreds of more lines
to it to implement vector search, and it reached a whopping 7,000 lines
of code. This is too much.

This patch splits from executor.cc two major chunks:

1. The implementation of **read** requests - GetItem, BatchGetItem,
   Query (base table, GSI/LSI, and vector-search), and Scan - was
   moved to a new source file alternator/executor_read.cc.
   The new file has 2,000 lines.

2. Moved 250 lines of template functions dealing with attribute paths
   and maps of them to a new header file, attribute_path.hh.
   These utilities are used for many different operations - various
   read operations use them for ProjectionExpression, and UpdateItem
   uses them for modifications to nested attributes, so we need the
   new header file from both executor.cc and executor_read.cc

The remaining executor.cc is still pretty big, 5,000 lines, and
contains write operations (PutItem, UpdateItem, DeleteItem,
BatchWriteItem) as well as various table and other operations, and
also many utility functions used by many types of operations, so
we can later continue this refactoring effort.

Refs #5783

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 14:30:10 +03:00
Emil Maskovsky
91df3795fc encryption: cover system.raft table in system_info_encryption
Extend system_info_encryption to encrypt system.raft SSTables.
system.raft contains the Raft log, which may hold sensitive user data
(e.g. batched mutations), so it warrants the same treatment as
system.batchlog and system.paxos.

During upgrade, existing unencrypted system.raft SSTables remain
readable. Existing data is rewritten encrypted via compaction, or
immediately via nodetool upgradesstables -a.

Update the operator-facing system_info_encryption description to
mention system.raft and add a focused test that verifies the schema
extension is present on system.raft.

Fixes: CUSTOMER-268

Backport: 2026.1 - closes an encryption-at-rest coverage gap: system.raft may persist sensitive user-originated data unencrypted; backport to the current LTS.

Closes scylladb/scylladb#29242
2026-04-16 13:22:10 +02:00
Botond Dénes
d006c4c476 Merge 'Untie (partially) cql3/statements from db::config' from Pavel Emelyanov
There's a bunch of db::config options that are used by cql3/statements/ code. For that they use data_dictionary/database as a proxy to get db::config reference. This PR moves most of these accessed options onto cql_config

Options migrated to cql_config:

   1. select_internal_page_size
   2. strict_allow_filtering
   3. enable_parallelized_aggregation
   4. batch_size_warn_threshold_in_kb
   5. batch_size_fail_threshold_in_kb
   6. 7 keyspace replication restriction options
   7. 2 TWCS restriction options
   8. restrict_future_timestamp
   9. strict_is_not_null_in_views (with view_restrictions struct)
   10. enable_create_table_with_compact_storage

Some options need special treatment and are still abused via database, namely:

  1. enable_logstor
  2. cluster_name
  3. partitioner
  4. endpoint_snitch

Fixing components inter-dependencies, not backporting

Closes scylladb/scylladb#29424

* github.com:scylladb/scylladb:
  cql3: Move enable_create_table_with_compact_storage to cql_config
  cql3: Move strict_is_not_null_in_views to cql_config
  cql3: Move restrict_future_timestamp to cql_config
  cql3: Move TWCS restriction options to cql_config
  cql3: Move keyspace restriction options to cql_config
  cql3: Move batch_size_fail_threshold_in_kb to cql_config
  cql3: Move batch_size_warn_threshold_in_kb to cql_config
  cql3: Move enable_parallelized_aggregation to cql_config
  cql3: Move strict_allow_filtering to cql_config
  cql3: Move select_internal_page_size to cql_config
  test: Fix cql_test_env to use updateable cql_config from db::config
  cql3: Add cql_config parameter to parsed_statement::prepare()
2026-04-16 14:04:43 +03:00
Botond Dénes
88a8324e68 erge 'db: store large data records in SSTable metadata and serve via virtual tables' from Benny Halevy
`system.large_partitions`, `system.large_rows`, and `system.large_cells` store records keyed by SSTable name. When SSTables are migrated between shards or nodes (resharding, streaming, decommission), the records are lost because the destination never writes entries for the migrated SSTables.

This patch series moves the source of truth for large data records into the SSTable's scylla metadata component (new `LargeDataRecords` tag 13) and reimplements the three `system.large_*` tables as virtual tables that query live SSTables on demand. A cluster feature flag (`LARGE_DATA_VIRTUAL_TABLES`) gates the transition for safe rolling upgrades.

When the cluster feature is enabled, each node drops the old system large_* tables and starts serving the corresponding tables using virtual tables that represent the large data records now stored on the sstables.
Note that the virtual tables will be empty after upgrade until the sstables that contained large data are rewritten, therefore it is recommended to run upgrade sstables compaction or major compaction to repopulate the sstables scylla-metadata with large data records.

1. **keys: move key_to_str() to keys/keys.hh** — make the helper reusable across large_data_handler, virtual tables, and scylla-sstable
2. **sstables: add LargeDataRecords metadata type (tag 13)** — new struct with binary-serialized key fields, scylla-sstable JSON support, format documentation
3. **large_data_handler: rename partition_above_threshold to above_threshold_result** — generalize the struct for reuse
4. **large_data_handler: return above_threshold_result from maybe_record_large_cells** — separate booleans for cell size vs collection elements thresholds
5. **sstables: populate LargeDataRecords from writer** — bounded min-heaps (one per large_data_type), configurable top-N via `compaction_large_data_records_per_sstable`
6. **test: add LargeDataRecords round-trip unit tests** — verify write/read, top-N bounding, below-threshold behavior
7. **db: call initialize_virtual_tables from shard 0 only** — preparatory refactoring to enable cross-shard coordination
8. **db: implement large_data virtual tables with feature flag gating** — three virtual table classes, feature flag activation, legacy SSTable fallback, dual-threshold dedup, cross-shard collection

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1276

* Although this fixes a bug where large data entries are effectively lost when sstables are renamed or migrated, the changes are intrusive and do not warrant a backport

Closes scylladb/scylladb#29257

* github.com:scylladb/scylladb:
  db: implement large_data virtual tables with feature flag gating
  db: call initialize_virtual_tables from shard 0 only
  test: add LargeDataRecords round-trip unit tests
  sstables: populate LargeDataRecords from writer
  large_data_handler: return above_threshold_result from maybe_record_large_cells
  large_data_handler: rename partition_above_threshold to above_threshold_result
  sstables: add LargeDataRecords metadata type (tag 13)
  sstables: add fmt::formatter for large_data_type
  keys: move key_to_str() to keys/keys.hh
2026-04-16 14:03:31 +03:00
Pavel Emelyanov
4d352c7cf5 sstables: Remove ignore_component_digest_mismatch from sstable_open_config
The ignore_component_digest_mismatch flag is now initialized at sstable construction
time from sstables_manager::config (which is populated from db::config at boot time).
Remove the flag from sstable_open_config struct and all call sites that were setting
it explicitly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:49:14 +03:00
Pavel Emelyanov
9107e055b3 sstables: Move ignore_component_digest_mismatch initialization to constructor
Initialize the ignore_component_digest_mismatch flag from sstables_manager::config
in the sstable constructor initializer list instead of in load(). This ensures the
flag value is set at construction time when the manager config is available, rather
than at load time. Mark the member const to reflect its immutability after construction.

Fixes the bootstrap path which now correctly reads the flag from manager config
initialized from db::config at boot time, instead of using the default value.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:49:00 +03:00
Pavel Emelyanov
8abfd9af00 sstables: Add ignore_component_digest_mismatch to sstables_manager config
Copy the ignore_component_digest_mismatch flag from db::config to sstables_manager::config
during database initialization. This makes the flag available early in the boot process,
before SSTables are loaded, enabling later commits to move the flag initialization from
load-time to construction-time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 13:48:49 +03:00
Nadav Har'El
83670d2493 alternator: validate vector index attribute values on write
When a table has a vector index, writes to the indexed attribute
(via PutItem, UpdateItem, or BatchWriteItem) must supply a value that
is a vector of the appropriate length: It must be a list of exactly the
declared number of elements, where each element is a numeric type ("N")
representable as a 32-bit float. Before this patch, invalid values were
silently accepted and the item was simply not indexed (it was skipped
by the vector store when it read this item). Now these writes are
rejected with a ValidationException.

This is analogous to the existing validation of GSI/LSI key attribute
values - in DynamoDB after a certain attribute becomes the key of a
GSI or LSI, the user is no longer allowed to write the same type.

The implementation we add here is also analogous to the implementation
of the GSI/LSI key validation. The GSI/LSI key validation is done
by validate_value_if_index_key / si_key_attributes, and in this
patch we add the vector-index parallels: vector_index_attributes()
collects the attribute name and declared dimensions for every vector
index in the schema, and validate_value_if_vector_index_attribute()
enforces the type limitations.

For efficiency in the common case where a table has no vector indexes
and no GSIs/LSIs, both validation functions are out-of-line and each
call site guards the call with an explicit empty() check, so no
function-call overhead is incurred when there is nothing to validate.
For UpdateItem, the map of vector index attributes is cached in
update_item_operation (alongside the existing _key_attributes cache)
to avoid recomputing it on every call to update_attribute().
2026-04-16 13:31:49 +03:00
Nadav Har'El
aea7b6a66b alternator: DescribeTable for vector index: add IndexStatus and Backfilling
Add to DescribeTable's output for VectorIndexes two fields - IndexStatus
and Backfilling - which are intended to exactly mirror these two fields
that exist for GlobalSecondaryIndexes:

When a vector index is added, IndexStatus is "CREATING" before the index
is usable, and "ACTIVE" when it is finally usable for a Query. During
"CREATING" phase, "Backfilling" may be set to true when the index is
currently being backfilled (the table is scaned and an index is built).

A user is expected to call DescribeTable in a loop after creating a
vector index (via either CreateTable and UpdateTable) and only call
Query on the index after the IndexStatus is finally ACTIVE. Calling
Query earlier, while IndexStatus is still CREATING, will result in an
error.

In the current implementation, Alternator does not track the state of the
vector index, so it needs to contact the vector store to inquire about
the state of the index - using a new function introduced in this patch
that uses an existing vector-store API. This makes DescribeTable slower
on tables that have vector indexes, because the vector store is contacted
on every DescribeTable call.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:31:49 +03:00
Nadav Har'El
e43a2e5086 alternator: implement Query with a vector index
We introduce to the Query request a new "VectorSearch" parameter, which
take a mandatory "QueryVector" (a value which must be a numeric vector
of the right length) and "Limit".

The "Limit" of a vector search (Query with VectorSearch) determines the
number of nearest neighbors to return, and does not allow pagination
(ExclusiveKeyStart is not allowed). ConsistentRead=True is also not
allowed on a vector search query.

The "Select"/"ProjectionExpression"/"AttributesToGet" parameters are
also supported, requesting which attributes to fetch. Using Select=
ALL_PROJECTED_ATTRIBUTES means read only the attributes found in the
vector index - currently only the key columns - so it is significantly
faster than ALL_ATTRIBUTES because it doesn't require reading the items
from the base table.

The "FilterExpression" parameter is also supported. Like in DynamoDB's
traditional Query, this does post-filtering, i.e., removing some of the
results returned by the vector index that don't match the filter, and
as a result fewer than Limit results may be returned.

Pre-filtering (done on the vector store, and always returns Limit
results) is not yet implemented.
2026-04-16 13:31:47 +03:00
Nadav Har'El
68e34c57e1 alternator: fix bug in describe_multi_item()
In commit a55c5e9ec7, the function
describe_multi_item() got a new item_callback parameter, that can
be used to calculate the size of the item. This new parameter
has a default, an empty noncopyable_function. But an empty
noncopyable_function shouldn't be called - exactly like std::function,
it throws std::bad_function_call if called when empty.

So describe_multi_item() should only call this item_callback if
it's not empty.

This became a problem in the next patch, implementing vector search
query, which called describe_multi_item with the default item_callback.
But in general, the function should be usable with the default parameter
(or we shouldn't have defined a default value for this parameter!).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:30:02 +03:00
Nadav Har'El
ffe1029b7c alternator: prevent adding GSI conflicting with a vector index
All the "indexes" we implement in Alternator - GSI, LSI and the new
vector index - share the same IndexName namespace, which we'll use in
Query to refer to the index. In the previous patch we already prevented
adding a vector index with the same name as an existing GSI or LSI.
In this patch we also prevent the reverse - adding a GSI with the name
of an existing vector index.

Additionally, one cannot add a GSI on a key that is already the key of
a vector index: The types conflict: The key of a vector index must be a
vector column, while the key of a GSI must have a standard key type
(string, binary or number).

We have tests for this later, this the big test patch.
2026-04-16 13:30:02 +03:00
Nadav Har'El
82de16f92c alternator: implement UpdateTable with a vector index
After an earlier patch allowed CreateTable to create vector indexes
together with a table, in this patch we add to UpdateTable the ability
to add a new vector index to an existing table, as well as the ability
to delete a vector index from an existing table.

The implementation is inspired by DynamoDB's syntax for GSI - just like
GSI has GlobalSecondaryIndexUpdates with "Create" and "Delete" operations,
for vector indexes we have VectorIndexUpdates supporting Create and
Delete. "Update" is not yet supported - we didn't implement yet any
parameter that can be updated - but we can easily implement it in the
future.
2026-04-16 13:30:02 +03:00
Nadav Har'El
217090a996 alternator: implement DescribeTable with a vector index
In this patch we add to DescribeTable the ability to list the vector indexes
enabled on an Alternator table.
2026-04-16 13:30:02 +03:00
Nadav Har'El
e156d67177 alternator: implement CreateTable with a vector index
ScyllaDB supports the "vector search" feature in CQL.
In this patch we start the path to adding vector search support also to
Alternator.

In this patch, we implement CreateTable support - allowing the user to enable
vector search in a new table. The following patches will enable additional
operations like UpdateTable (adding a vector index to an existing table or
deleting a vector index to an existing table) and DescribeTable.

Extensive tests for all these features will come at the end of the series.
Those tests were written in parallel with writing this implementation so cover
(hopefully) every nook and cranny of the imlementation.
2026-04-16 13:29:58 +03:00
Nadav Har'El
0afc730b7b alternator: reject empty attribute names
Alternator has a function validate_attr_name_length() used to validate an
attribute name passed in different operations like PutItem, UpdateItem,
GetItem, etc. It fails the request if the attribute name is longer than
65535 characters.

It turns out that we forgot to check if the attribute name length isn’t 0 -
which should be forbidden as well!

This patch fixes the validation code, and also adds a test that confirms
that after this patch empty attribute names are rejected - just like DynamoDB
does - whereas before this patch they were silently accepted.

We want to fix this issue now, because in a later patch we intend to use
the same validation function also for vector indexes - and want it to be
accurate.

Fixes SCYLLADB-1069.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:28:15 +03:00
Nadav Har'El
8948a50f3b cdc: fix on_pre_create_column_families to create CDC log for vector search
The vector-search feature, which is already supported in CQL, introduced
the somewhat confusing feature of enabling CDC without explicitly enabling
CDC: When a vector index is enabled on a table, CDC is "enabled" for it
even if the user didn't ask to enable CDC.

For this, some code in cdc/log.cc began to use cdc_enabled() instead of
checking schema.cdc_options.enabled() directly. This cdc_enabled()
function checks if either this enabled() is true, or has_vector_index()
is true.

But there's another twist to this story: To write with CDC, we also need
to create the CDC log table:

1. In CQL, a vector index can only be added on an existing table (with
   CREATE INDEX), so the hook on_before_update_column_family() is the
   one that noticed that a vector index was added, and created the CDC
   log table.

2. But in Alternator, a vector index can be created up-front with a
   brand-new table (in CreateTable), so the hook for a new table -
   on_pre_create_column_families(), also needs to create the CDC log
   table. It already did, but incorrectly checked just the explicit
   CDC-enabled flag instead of the new cdc_enabled() function that
   also allows vector index.

So this patch just fixes on_pre_create_column_families to use cdc_enabled().

Before this patch, when a vector index will be created in Alternator with
CreateTable, an attempt to write to the table (PutItem) will fail
because it will try to write to the CDC log, which wasn't created.
After this patch, it works. The reproducing test is
test_putitem_vectorindex_createtable (introduced in a later patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-16 13:28:15 +03:00
Roy Dahan
d2d7604188 ci: pin GitHub Actions to commit SHAs and migrate to Node.js 24
Pin all external GitHub Actions to full commit SHAs and upgrade to
their latest major versions to reduce supply chain attack surface:

- actions/checkout: v3/v4/v5 -> v6.0.2
- actions/github-script: v7 -> v8.0.0
- actions/setup-python: v5 -> v6.2.0
- actions/upload-artifact: v4 -> v7.0.0
- astral-sh/setup-uv: v6 -> v8.0.0
- mheap/github-action-required-labels: v5.5.2 (pinned)
- redhat-plumbers-in-action/differential-shellcheck: v5.5.6 (pinned)
- codespell-project/actions-codespell: v2.2 (pinned, was @master)

Set FORCE_JAVASCRIPT_ACTIONS_TO_NODE24=true in all 21 workflows that
use JavaScript-based actions to opt into the Node.js 24 runtime now.
This resolves the deprecation warning:

  "Node.js 20 actions are deprecated. Please check if updated versions
   of these actions are available that support Node.js 24. Actions will
   be forced to run with Node.js 24 by default starting June 2nd,
   2026. Node.js 20 will be removed from the runner on September 16th,
   2026."

See: https://github.blog/changelog/2025-09-19-deprecation-of-node-20-on-github-actions-runners/

scylladb/github-automation references are intentionally left at @main
as they are org-internal reusable workflows.

Fixes: SCYLLADB-1410

Backport: Backport is required for live branches that run GH actions:
2026.1, 2025.4, 2025.1 and 2024.1

Closes scylladb/scylladb#29421
2026-04-16 13:03:33 +03:00
Pavel Emelyanov
207d3b4a68 test_backup: Remove create_schema() helper Test
Remove the create_schema() helper function and inline its logic directly into the four call sites. This simplifies the code by eliminating a trivial wrapper.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29406
2026-04-16 12:57:26 +03:00
Botond Dénes
830d28a889 Merge 'Use standard helpers to create ks:cf and populate it in test_backup.py' from Pavel Emelyanov
The PR removed the create_and_ks() helper from backup test and patches all callers to create keyspace, table and populate them with standard explicit facilities. While patching it turned out that one test doesn't need to populate the table, so it even becomes tiny bit shorter and faster

Enhancing test, not backporting

Closes scylladb/scylladb#29417

* github.com:scylladb/scylladb:
  test_backup: Remove create_ks_and_cf helper Test
  test_backup: Replace create_ks_and_cf with async patterns Test
  test_backup: Add if-True blocks for indentation Test
2026-04-16 12:54:21 +03:00
Nikos Dragazis
7abcf94823 test: Use arbitrary tokens in vnodes->tablets migration tests
The migration tests used to start nodes with pre-computed power-of-two
tokens. This was required because the migration itself only supported
power-of-two aligned tokens. Now that arbitrary tokens are supported,
switch the tests to use Scylla's default random token assignment.

Switching to arbitrary tokens makes the tests non-deterministic, but the
migration aspects that are affected by the token distribution
(resharding, wrap-around vnode split) are out of scope for these tests
and covered by dedicated tests.

Add a `get_all_vnode_tokens()` helper that queries system.topology at
runtime to discover the actual token layout, and derive expected tablet
counts from that.

Also account for the possible extra wrap-around tablet when the last
vnode token does not coincide with MAX_TOKEN.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:47:27 +03:00
Nikos Dragazis
26f0c038af test: boost: Add test for wrap-around vnodes
Add a Boost test to verify that `prepare_for_tablets_migration()`
produces the correct tablet boundaries when a wrap-around vnode exists.

Tablets cannot wrap around the token ring as vnodes do; the last token
of the last tablet must always be MAX_TOKEN. When the last vnode token
does not coincide with MAX_TOKEN, the wrap-around vnode must be split
into two tablets.

The test is parameterized over both cases: unaligned (split expected)
and aligned (no split expected).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:47:16 +03:00
Botond Dénes
c355df4461 Merge 'test: Lower default log level from DEBUG to INFO' from Artsiom Mishuta
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments

+minor fix

[test/pylib: save logs on success only during teardown phase](0ede308a04)
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.

Closes scylladb/scylladb#29086

* github.com:scylladb/scylladb:
  test/pylib: save logs on success only during teardown phase
  test: Lower default  log level from DEBUG to INFO
2026-04-16 12:46:11 +03:00
Nikos Dragazis
098732ff76 storage_service: Support vnodes->tablets migrations w/ arbitrary tokens
The vnodes-to-tablets migration creates tablet maps that mirror the
vnode layout: one tablet per vnode, preserving token boundaries and
replica placement. However, due to tablet restrictions, the migration
requires vnode tokens to be a power of two and uniformly distributed
across the token ring.

In practice, this restriction is too limiting. Real clusters use
randomly generated tokens and a node's token assignment is immutable.

To solve this problem, prior work (01fb97ee78) has been done to relax
the tablet constraints by allowing arbitrary tablet boundaries, removing
the requirement for power-of-two sizing and uniform distribution.

This patch leverages the relaxed tablet constraints to enable tablet map
creation from arbitrary vnode tokens:

* Removes all token-related constraints.
* Handles wrap-around vnodes. If a vnode wraps (i.e., the highest vnode
  token is not `dht::token::last()`), it is split into two tablets:
  - (last_vnode_token, dht::token::last()]
  - [dht::token::first(), first_vnode_token]

The migration ops guide has been updated to remove the power-of-two
constraint.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:39:23 +03:00
Nikos Dragazis
8ea8c05120 storage_service: Hoist migration precondition
`prepare_for_tablets_migration()` is idempotent; it filters out tables
that already have tablet maps and returns early if no tablet maps need
to be created. However, this precondition is currently misplaced. Move
it higher to skip extra work.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2026-04-16 12:19:34 +03:00
Botond Dénes
9bfcc25cf7 Merge 'streaming: stream_blob: hold table for streaming' from Michael Litvak
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.

For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.

For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.

Fixes [SCYLLADB-1533](https://scylladb.atlassian.net/browse/SCYLLADB-1533)

It's a pre-existing use-after-free issue in sstable file streaming so should be backported to all releases.
It's also made worse with the recent changes of logstor, and affects also non-logstor tables, so the logstor fixes should be in the same release (2026.2).

[SCYLLADB-1533]: https://scylladb.atlassian.net/browse/SCYLLADB-1533?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

Closes scylladb/scylladb#29488

* github.com:scylladb/scylladb:
  test: test drop table during streaming
  streaming: stream_blob: hold table for streaming
2026-04-16 12:12:42 +03:00
Dario Mirovic
50e498ac0d test/lib: fix typos in proc_utils, gcs_fixture, and dockerized_service
Fix assorted typos in comments, strings, and identifiers:
- path_preprend -> path_prepend (proc_utils.hh, proc_utils.cc)
- laúnch -> launch (proc_utils.cc)
- hand/fail -> hang/fail (dockerized_service.py)
- inconvinient -> inconvenient (dockerized_service.py)
- priviledges -> privileges (gcs_fixture.hh)
- remove double semicolon (gcs_fixture.cc)

Refs SCYLLADB-1542
2026-04-16 10:58:55 +02:00
Dario Mirovic
11b5997eaf test: gcs_fixture: rename container from "local-kms" to "fake-gcs-server"
The GCS fixture's fake-gcs-server container was named "local-kms",
copy-pasted from the AWS KMS fixture. It happened when both were
refactored to use the shared start_docker_service helper (bc544eb08e).

Rename to "fake-gcs-server" to match the Python-side naming and avoid
confusion in logs.

Refs SCYLLADB-1542
2026-04-16 10:58:52 +02:00
Dario Mirovic
dc7f848bf8 test: fix proc_utils.cc formatting from previous commit
Fix indentation of lines moved inside the for-loop in
start_docker_service (lines 208-225).

Refs SCYLLADB-1542
2026-04-16 10:55:48 +02:00
Dario Mirovic
be4d32c474 test: lib: use unique container name per retry attempt
The container name is generated once before the retry loop, so
all retry attempts reuse the same name. Move the name generation
inside the loop so each attempt gets a fresh name via the
incrementing counter, consistent with the comment "publish port
ephemeral, allows parallel instances".

Formatting changes (indentation) of lines 208-225 in test/lib/proc_utils.cc
will be fixed in the next commit.

Refs SCYLLADB-1542
2026-04-16 10:55:04 +02:00
Botond Dénes
33682fd14e Merge 'sstables/storage_manager: fix race between object storage config update and keyspace creation' from Dimitrios Symonidis
Previously, config_updater used a serialized_action to trigger update_config() when object_storage_endpoints changed. Because serialized_action::trigger() always schedules the action as a new reactor task (via semaphore::wait().then()), there was a window between the config value becoming visible to the REST API and update_config() actually running. This allowed a concurrent CREATE KEYSPACE to see the new endpoint via is_known_endpoint() before storage_manager had registered it in _object_storage_endpoints.

Now config observers run synchronously in a reactor turn and must not suspend. Split the previous monolithic async update_config() coroutine  into two phases:

- Sync (in the observer, never suspends): storage_manager::_object_storage_endpoints is updated in place; for already-instantiated clients, update_config_sync swaps the new config atomically
- Async (per-client gate): background fibers finish the work that can't run in the observer — S3 refreshes credentials under _creds_sem; GCS drains and closes the replaced client.

Config reloads triggered by SIGHUP are applied on shard 0 and then broadcast to all other shards. An rwlock has been also introduced to make sure that the configuration has been propagated to all cores. This guarantees that a client requesting a config via the REST API will see a consistent snapshot

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-757
Fixes: [28141](https://github.com/scylladb/scylladb/issues/28141)

Closes scylladb/scylladb#28950

* github.com:scylladb/scylladb:
  test/object_store: verify object storage client creation and live reconfiguration
  sstables/utils/s3: split config update into sync and async parts
  test_config: improve logging for wait_for_config API
  db: introduce read-write lock to synchronize config updates with REST API
2026-04-16 10:20:43 +03:00
Michael Litvak
43c76aaf2b logstor: split log record to header and data
Split the `log_record` to `log_record_header` type that has the record
metadata fields and the mutation as a separate field which is the actual
record data:

struct log_record {
    log_record_header header;
    canonical_mutation mut;
};

Both the header and mutation have variable serialized size. When a
record is serialized in a write_buffer, we first put a small
`record_header` that has the header size and data size, then the
serialized header and data follow. The `log_location` of a record points
to the beginning of the `record_header`, and the size includes the
`record_header`.

This allows us to read a record header without reading the data when
it's not needed and avoid deserializing it:
* on recovery, when scanning all segments, we read only the record
  headers.
* on compaction, we read the record header first to determine if the
  record is alive, if yes then we read the data.

Closes scylladb/scylladb#29457
2026-04-16 10:00:35 +03:00
Botond Dénes
8e7ba7efe2 Merge 'commitlog: fix segment replay order by using ordered map per shard' from Sergey Zolotukhin
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard.

Correct segment ordering is required for:
- Fragmented entry reconstruction, which accumulates fragments across
  segments and depends on ascending order for efficient processing.
- Commitlog-based storage used by the strongly consistent tables
  feature, which relies on replayed raft items being stored in order.

Fix by changing the data structure from
  std::unordered_multimap<unsigned, commitlog::descriptor>
to
  std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>

Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.

Fixes: SCYLLADB-1411

Backport: It looks like this issue doesn't cause any trouble, and is required only by the strong consistent tables, so no backporting required.

Closes scylladb/scylladb#29372

* github.com:scylladb/scylladb:
  commitlog: add test to verify segment replay order
  commitlog: fix replay order by using ordered map per shard
2026-04-16 09:55:27 +03:00
Pavel Emelyanov
335261f351 cql3: Move enable_create_table_with_compact_storage to cql_config
Move enable_create_table_with_compact_storage option from db::config to
cql_config. This improves separation of concerns by consolidating CQL-specific
table creation policies in the cql_config structure. Update the CREATE TABLE
statement prepare() function to use the new location for the configuration check.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:20 +03:00
Pavel Emelyanov
f20ede79f9 cql3: Move strict_is_not_null_in_views to cql_config
Move strict_is_not_null_in_views option from db::config to cql_config via
new view_restrictions sub-struct. This improves separation of concerns by
keeping view-specific validation policies with other CQL configuration.
Update prepare_view() to take view_restrictions reference instead of reaching
into db::config, and update all callsites to pass the sub-struct.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:52:19 +03:00
Pavel Emelyanov
027c91f45e cql3: Move restrict_future_timestamp to cql_config
Move restrict_future_timestamp option from db::config to cql_config. This
improves separation of concerns as timestamp validation is part of CQL query
execution behavior. Update validate_timestamp() function signature to take
cql_config reference instead of db::config, and update all callsites in
modification_statement and batch_statement to pass cql_config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:53 +03:00
Pavel Emelyanov
7264581881 cql3: Move TWCS restriction options to cql_config
Move twcs_max_window_count and restrict_twcs_without_default_ttl options
from db::config to cql_config via new twcs_restrictions sub-struct. This
improves separation of concerns by keeping TWCS-specific validation policies
with other CQL configuration. Update check_restricted_table_properties()
to remove unused db parameter and take twcs_restrictions reference instead.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:52 +03:00
Pavel Emelyanov
8b853505cd cql3: Move keyspace restriction options to cql_config
Introduce replication_restrictions, a sub-struct of cql_config, to hold
the seven keyspace-level policy options that govern how CREATE/ALTER
KEYSPACE statements are validated:
  - restrict_replication_simplestrategy
  - replication_strategy_warn_list / replication_strategy_fail_list
  - minimum/maximum_replication_factor_warn/fail_threshold

Pass replication_restrictions into check_against_restricted_replication_strategies()
instead of having it reach into db::config directly (via both
qp.db().get_config() and qp.proxy().data_dictionary().get_config()).

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 08:51:24 +03:00
Benny Halevy
ce00d61917 db: implement large_data virtual tables with feature flag gating
Replace the physical system.large_partitions, system.large_rows, and
system.large_cells CQL tables with virtual tables that read from
LargeDataRecords stored in SSTable scylla metadata (tag 13).

The transition is gated by a new LARGE_DATA_VIRTUAL_TABLES cluster
feature flag:

- Before the feature is enabled: the old physical tables remain in
  all_tables(), CQL writes are active, no virtual tables are registered.
  This ensures safe rollback during rolling upgrades.

- After the feature is enabled: old physical tables are dropped from
  disk via legacy_drop_table_on_all_shards(), virtual tables are
  registered on all shards, and CQL writes are skipped via
  skip_cql_writes() in cql_table_large_data_handler.

Key implementation details:

- Three virtual table classes (large_partitions_virtual_table,
  large_rows_virtual_table, large_cells_virtual_table) extend
  streaming_virtual_table with cross-shard record collection.

- generate_legacy_id() gains a version parameter; virtual tables
  use version 1 to get different UUIDs than the old physical tables.

- compaction_time is derived from SSTable generation UUID at display
  time via UUID_gen::unix_timestamp().

- Legacy SSTables without LargeDataRecords emit synthetic summary
  rows based on above_threshold > 0 in LargeDataStats.

- The activation logic uses two paths: when the feature is already
  enabled (test env, restart), it runs as a coroutine; when not yet
  enabled, it registers a when_enabled callback that runs inside
  seastar::async from feature_service::enable().

- sstable_3_x_test updated to use a simplified large_data_test_handler
  and validate LargeDataRecords in SSTable metadata directly.
2026-04-16 08:49:02 +03:00
Benny Halevy
cb6004b625 db: call initialize_virtual_tables from shard 0 only
Move the smp::invoke_on_all dispatch from the callers into
initialize_virtual_tables() itself, so the function is called
once from shard 0 and internally distributes the per-shard
virtual table setup to all shards.

This simplifies the callers and allows a single place to add
cross-shard coordination logic (e.g. feature-gated table
registration) in future commits.
2026-04-16 08:49:02 +03:00
Benny Halevy
90d4ff34fb test: add LargeDataRecords round-trip unit tests
Add three new test cases to sstable_3_x_test.cc that verify the
LargeDataRecords metadata written by the SSTable writer can be read
back after open_data():

- test_large_data_records_round_trip: verifies partition_size, row_size,
  and cell_size records are written with correct field semantics when
  thresholds are exceeded
- test_large_data_records_top_n_bounded: verifies the bounded min-heap
  keeps only the top-N largest entries per type
- test_large_data_records_none_when_below_threshold: verifies no records
  are written when data is below all thresholds

Also wire large_data_records_per_sstable from db_config into the test
env's sstables_manager::config so that config changes propagate through
the updateable_value chain to configure_writer().
2026-04-16 08:49:02 +03:00
Benny Halevy
1f7faeef57 sstables: populate LargeDataRecords from writer
During compaction (SSTable writing), maintain bounded min-heaps (one per
large_data_type) that collect the top-N above-threshold records.  On
stream end, drain all five heaps into a single LargeDataRecords array
and write it into the SSTable's scylla metadata component.

Five separate heaps are used:
- partition_size, row_size, cell_size: ordered by value (size bytes)
- rows_in_partition, elements_in_collection: ordered by elements_count

A new config option 'compaction_large_data_records_per_sstable' (default
10) controls the maximum number of records kept per type.
2026-04-16 08:49:02 +03:00
Benny Halevy
8f4976f65d large_data_handler: return above_threshold_result from maybe_record_large_cells
Change maybe_record_large_cells to return above_threshold_result with
separate booleans for cell size (.size) and collection elements
(.elements) thresholds.  This allows the writer to track above_threshold
counts for cell_size and elements_in_collection independently.
2026-04-16 08:49:02 +03:00
Benny Halevy
c1b797f288 large_data_handler: rename partition_above_threshold to above_threshold_result
Rename partition_above_threshold to above_threshold_result and its
'rows' field to 'elements', making it a generic struct that can be
reused for other large data types (e.g., cells with collection
elements).

Use designated initializers for clarity.
2026-04-16 08:49:02 +03:00
Benny Halevy
d92cd42fe6 sstables: add LargeDataRecords metadata type (tag 13)
Add a new scylla metadata component LargeDataRecords (tag 13) that
stores per-SSTable top-N large data records.  Each record carries:
  - large_data_type (partition_size, row_size, cell_size, etc.)
  - binary serialized partition key and clustering key
  - column name (for cell records)
  - value (size in bytes)
  - element count (rows or collection elements, type-dependent)
  - range tombstones and dead rows (partition records only)

The struct uses disk_string<uint32_t> for key/name fields and is
serialized via the existing describe_type framework into the SSTable
Scylla metadata component.

Add JSON support in scylla-sstable and format documentation.
2026-04-16 08:49:01 +03:00
Benny Halevy
85e2c6f2a7 sstables: add fmt::formatter for large_data_type
Add a fmt::formatter specialization for sstables::large_data_type and
use it in scylla-sstable.cc instead of the local to_string() overload,
which is removed.
2026-04-16 08:42:54 +03:00
Benny Halevy
d4283d0ffc keys: move key_to_str() to keys/keys.hh
Move the key_to_str() template function from a file-local static in
db/large_data_handler.cc to keys/keys.hh so it can be reused by:
- large_data_handler.cc for log messages
- virtual tables (db/virtual_tables.cc) for converting binary keys
  to human-readable CQL display
- scylla-sstable for JSON output of LargeDataRecords

No functional change.
2026-04-16 08:42:54 +03:00
Pavel Emelyanov
1af26a1dd6 cql3: Move batch_size_fail_threshold_in_kb to cql_config
The batch_size_fail_threshold_in_kb option controls the batch size at
which an oversized batch error is returned to the client. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4d255cf533 cql3: Move batch_size_warn_threshold_in_kb to cql_config
The batch_size_warn_threshold_in_kb option controls the batch size at
which a client warning is emitted during batch execution. It belongs in
cql_config rather than db::config as it directly governs CQL batch
statement behavior.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
a3f097f100 cql3: Move enable_parallelized_aggregation to cql_config
The enable_parallelized_aggregation option controls whether aggregation
queries are fanned out across shards for parallel execution. It belongs
in cql_config rather than db::config as it directly governs CQL query
behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:27 +03:00
Pavel Emelyanov
4314fc0642 cql3: Move strict_allow_filtering to cql_config
The strict_allow_filtering option controls whether queries that require
ALLOW FILTERING are silently accepted, warned about, or rejected. It
belongs in cql_config rather than db::config as it directly governs CQL
query behavior at prepare time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
3411ed8bcc cql3: Move select_internal_page_size to cql_config
The select_internal_page_size option controls CQL query execution
behavior (internal paging for aggregate/filtered SELECTs) and belongs
in cql_config rather than being read directly from db::config at
execution time.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
728eb20b42 test: Fix cql_test_env to use updateable cql_config from db::config
The test environment was creating cql_config with hardcoded default values that
were never updated when system.config was modified via CQL. This broke tests
that dynamically change configuration values (e.g., TWCS tests).

Fix by creating cql_config from db::config using sharded_parameter, which
ensures updateable_value fields track the actual db::config sources and reflect
changes made during test execution.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2026-04-16 07:57:26 +03:00
Pavel Emelyanov
60a834d9fa cql3: Add cql_config parameter to parsed_statement::prepare()
Pass cql_config to prepare() so that statement preparation can use
CQL-specific configuration rather than reaching into db::config
directly.

Callers that use default_cql_config:
- db/view/view.cc: builds a SELECT statement internally to compute view
  restrictions, not in response to a user query
- cql3/statements/create_view_statement.cc: same -- parses the view's
  WHERE clause as a synthetic SELECT to extract restrictions
- tools/schema_loader.cc: offline schema loading tool, no runtime
  config available
- tools/scylla-sstable.cc: offline sstable inspection tool, no runtime
  config available

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-16 07:57:25 +03:00
Nadav Har'El
f0e9177130 Merge 'audit/alternator: Make Alternator requests audited' from Piotr Szymaniak
Each Alternator API call results in the request being audited, provided the auditing is enabled.
Both successful as well as the failed requests are audited, with few exceptions.

The chosen audit types for the operations:
- CreateTable - DDL
- DescribeTable - QUERY
- DeleteTable - DDL
- UpdateTable - DDL
- PutItem - DML
- UpdateItem - DML
- GetItem - QUERY
- DeleteItem - DML
- ListTables - QUERY
- Scan - QUERY
- DescribeEndpoints - QUERY
- BatchWriteItem - DML
- BatchGetItem - QUERY
- Query - QUERY
- TagResource - DDL
- UntagResource - DDL
- ListTagsOfResource - QUERY
- UpdateTimeToLive - DDL
- DescribeTimeToLive - QUERY
- ListStreams - QUERY
- DescribeStream - QUERY
- GetShardIterator - QUERY
- GetRecords - QUERY
- DescribeContinuousBackups - QUERY

FIXME: The tests are now covering the new functionality only partially.

Fixes: scylladb/scylla-enterprise#3796
Fixes: SCYLLADB-467

No need to backport, new functionality.

Closes scylladb/scylladb#27953

* github.com:scylladb/scylladb:
  audit/alternator: support audit_tables=alternator.<table> shorthand
  audit/alternator: Add negative audit tests
  audit/alternator: Add testing of auditing
  audit/alternator: Audit requests
  audit/alternator: Refactor in preparation for auditing Alternator
2026-04-15 22:17:57 +03:00
Nikos Dragazis
d38f44208a test/cqlpy: Harden mutation_fragments tests against background flushes
Several tests in test_select_from_mutation_fragments.py assume that all
mutations end up in a single SSTable. This assumption can be violated
by background memtable flushes triggered by commitlog disk pressure.

Since the Scylla node is taken from a pool, it may carry unflushed data
from prior tests that prevents closed segments from being recycled,
thereby increasing the commitlog disk usage. A main source of such
pressure is keyspace-level flushes from earlier tests in this module,
which rotate commitlog segments without flushing system tables (e.g.,
`system.compaction_history`), leaving closed segments dirty.
Additionally, prior tests in the same module may have left unflushed
data on the shared test table (`test_table` fixture), keeping commitlog
segments dirty on its behalf as well. When commitlog disk usage exceeds
its threshold, the system flushes the test table to reclaim those
segments, potentially splitting a running test's mutations across
multiple SSTables.

This was observed in CI, where test_paging failed because its data was
split across two SSTables, resulting in more mutation fragments than the
hardcoded expected count.

This patch fixes the affected tests in two ways:

1. Where possible, tests are reworked to not assume a single SSTable:
   - test_paging
   - test_slicing_rows
   - test_many_partition_scan

2. Where rework is impractical, major compaction is added after writes
   and before validation to ensure that only one SSTable will exist:
   - test_smoke
   - test_count
   - test_metadata_and_value
   - test_slicing_range_tombstone_changes

Fixes SCYLLADB-1375.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#29389
2026-04-15 21:46:00 +03:00
Michael Litvak
cc94467097 test: test drop table during streaming
Add a test that drops a table while tablet streaming is running for the
table. The table is dropped after taking the storage snapshot and
initializating streaming sources - after that streaming should be able
to complete or abort correctly if the table is dropped. We want to
verify there is no incorrect access to the destroyed table.

The test tests both types of streaming in stream_blob - sstables
and logstor segments.
2026-04-15 19:23:00 +02:00
Michael Litvak
69d2a90106 streaming: stream_blob: hold table for streaming
When initializing streaming sources in tablet_stream_files_handler we
use a reference to the table. We should hold the table while doing so,
because otherwise the table may be dropped and destroyed when we yield.
Use the table.stream_in_progress() phaser to hold the table while we
access it.

For sstable file streaming we can release the table after the snapshot
is initialized, and the table may be dropped safely because the files
are held by the snapshot and we don't access the table anymore. There
was a single access to the table for logging but it is replaced by a
pre-calculated variable.

For logstor segment streaming, currently it doesn't support discarding
the segments while they are streamed - when the table is dropped it
discard the segments by overwriting and freeing them, so they shouldn't
be accessed after that. Therefore, in that case continue to hold the
table until streaming is completed.

Fixes SCYLLADB-1533
2026-04-15 19:22:42 +02:00
Avi Kivity
59ec93b86b Merge 'Allow arbitrary tablet boundaries and count' from Tomasz Grabiec
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any token, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

We can also split and merge incrementally individual tablets.
Currently, we do it for the whole table or nothing, which makes
splits and merges take longer and cause wide swings of the count.
This is not implemented in this PR yet, we still split/merge the whole table.

Another reason is vnode to tablets migration. We now could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from CQL-coordinator point of view.

Tablet count is still a power-of-two by default for newly created tables.
It may be different if tablet map is created by non-standard means,
or if per-table tablet option "pow2_count" is set to "false".

build/release/scylla perf-tablets:

Memory footprint for 131k tablets increased from 56 MiB to 58.1 MiB (+3.5%)

Before:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 57456 KiB
Copied in 0.014346 [ms]
Cleared in 0.002698 [ms]
Saved in 1234.685303 [ms]
Read in 445.577881 [ms]
Read mutations in 299.596313 [ms] 128 mutations
Read required hosts in 247.482742 [ms]
Size of canonical mutations: 33.945053 [MiB]
Disk space used by system.tablets: 1.456761 [MiB]
Tablet metadata reload:
full      407.69ms
partial     2.65ms
```

After:
```
Generating tablet metadata
Total tablet count: 131072
Size of tablet_metadata in memory: 59504 KiB
Copied in 0.032475 [ms]
Cleared in 0.002965 [ms]
Saved in 1093.877441 [ms]
Read in 387.027100 [ms]
Read mutations in 255.752121 [ms] 128 mutations
Read required hosts in 211.202805 [ms]
Size of canonical mutations: 33.954453 [MiB]
Disk space used by system.tablets: 1.450162 [MiB]
Tablet metadata reload:
full      354.50ms
partial     2.19ms
```

Closes scylladb/scylladb#28459

* github.com:scylladb/scylladb:
  test: boost: tablets: Add test for merge with arbitrary tablet count
  tablets, database: Advertise 'arbitrary' layout in snapshot manifest
  tablets: Introduce pow2_count per-table tablet option
  tablets: Prepare for non-power-of-two tablet count
  tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
  tablets: Prepare resize_decision to hold data in decisions
  tablets: table: Make storage_group handle arbitrary merge boundaries
  tablets: Make stats update post-merge work with arbitrary merge boundaries
  locator: tablets: Support arbitrary tablet boundaries
  locator: tablets: Introduce tablet_map::get_split_token()
  dht: Introduce get_uniform_tokens()
2026-04-15 18:57:22 +03:00
Andrzej Jackowski
78926d9c96 test/random_failures: remove gossip shadow round injection
Commit c17c4806a1 removed check_for_endpoint_collision() from
the fresh bootstrap path, which was the only code path that
called do_shadow_round() for new nodes. Since the gossip shadow
round is no longer executed during bootstrap, remove the
stop_during_gossip_shadow_round error injection from the test.

The entry is marked as REMOVED_ rather than deleted to preserve
the shuffle order for seed-based test reproducibility.

The injection point in gms/gossiper.cc is also removed since it
is no longer used by any test.

Fixes: SCYLLADB-1466

Closes scylladb/scylladb#29460
2026-04-15 16:30:55 +02:00
Dario Mirovic
336dab1eec test: lib: fix broken retry in start_docker_service
The retry loop in start_docker_service passes the parse callbacks
via std::move into create_handler on each iteration. After the
first iteration, the moved-from std::function objects are empty.
All subsequent retries skip output parsing entirely and
immediately treat the service as successfully started. This
defeats the entire purpose of the retry mechanism.

Fix by passing the callbacks by copy instead of move, so the
original callbacks remain valid across retries.

Fixes SCYLLADB-1542
2026-04-15 15:25:52 +02:00
Asias He
4137a4229c test: Stabilize tablet incremental repair error test
Use async tablet repair task flow to avoid a race where client timeout
returns while server-side repair continues after injections are
disabled.

Start repair with await_completion=false, assert it does not complete
within timeout under injection, abort/wait the task, then verify
sstables_repaired_at is unchanged.

Fixes SCYLLADB-1184

Closes scylladb/scylladb#29452
2026-04-15 16:24:43 +03:00
Dimitrios Symonidis
ca003680a7 test/object_store: verify object storage client creation and live reconfiguration 2026-04-15 14:28:39 +02:00
Dimitrios Symonidis
24a7b146fa sstables/utils/s3: split config update into sync and async parts
Config observers run synchronously in a reactor turn and must not
suspend. Split the previous monolithic async update_config() coroutine
into two phases:

Sync (runs in the observer, never suspends):
  - S3: atomically swap _cfg (lw_shared_ptr) and set a credentials
    refresh flag.
  - GCS: install a freshly constructed client; stash the old one for
    async cleanup.
  - storage_manager: update _object_storage_endpoints and fire the
    async cleanup via a gate-guarded background fiber.

Async (gate-guarded background fiber):
  - S3: acquire _creds_sem, invalidate and rearm credentials only if
    the refresh flag is set.
  - GCS: drain and close stashed old clients.
2026-04-15 14:28:31 +02:00
Dimitrios Symonidis
a958da0ab9 test_config: improve logging for wait_for_config API 2026-04-15 14:28:31 +02:00
Dimitrios Symonidis
71714fdc0e db: introduce read-write lock to synchronize config updates with REST API
Config is reloaded from SIGHUP on shard 0 and broadcast to all shards
under a write lock. REST API callers reading find_config_id acquire a
read lock via value_as_json_string_for_name() and are guaranteed a
consistent snapshot even when a reload is in progress.
2026-04-15 14:28:31 +02:00
dependabot[bot]
d584e8e321 build(deps): bump sphinx-scylladb-theme from 1.9.1 to 1.9.2 in /docs
Bumps [sphinx-scylladb-theme](https://github.com/scylladb/sphinx-scylladb-theme) from 1.9.1 to 1.9.2.
- [Release notes](https://github.com/scylladb/sphinx-scylladb-theme/releases)
- [Commits](https://github.com/scylladb/sphinx-scylladb-theme/compare/1.9.1...1.9.2)

---
updated-dependencies:
- dependency-name: sphinx-scylladb-theme
  dependency-version: 1.9.2
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#29476
2026-04-15 14:57:37 +03:00
Gleb Natapov
ca24dd4a5f topology coordinator: log request cancellation only when request are really canceled
Currently cancellation is logged in get_next_task, but the function is
called by tablets code as well where we do not act upon its result, only
yield to the topology coordinator. But the topology coordinator will not
necessary do the cancellation as well since it can be busy with tablets
migration. As a result cancellation is logged, but not done which is
confusing. Fix it by logging cancellation when it is actually happens.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1409

Closes scylladb/scylladb#29471
2026-04-15 14:46:59 +03:00
Botond Dénes
280fe7cfb7 Merge 'Make inclusion of config.hh cheaper' from Nadav Har'El
This is an attempt (mostly suggested and implemented by AI, but with a few hours of human babysitting...), to somewhat reduce compilation time by picking one template, named_value<T>, which is used in more than a hundred source files through the config.hh header, and making it use external instantiation: The different methods of named_value<T> for various T are instantiated only once (in config.cc), and the individual translation units don't need to compile them a hundred times.

The resulting saving is a little underwhelming: The total object-file size goes down about 1% (from 346,200 before the patch to 343,488 after the patch), and previous experience shows that this object-file size is proportional to the compilation time, most of which involves code generation. But I haven't been able to measure speedup of the build itself.

1% is not nothing, but not a huge saving either. Though arguably, with 50 more of these patches, we can make the build twice faster :-)

Refs #1.

Closes scylladb/scylladb#28992

* github.com:scylladb/scylladb:
  config: move named_value<T> method bodies out-of-line
  config: suppress named_value<T> instantiation in every source file
2026-04-15 14:40:15 +03:00
Botond Dénes
00d8470554 Merge 'test: filter benign shutdown errors in tests that grep logs directly' from Marcin Maliszkiewicz
Tests that call grep_for_errors() directly and assert no errors
can fail spuriously due to benign RPC errors during graceful
shutdown (e.g. "connection dropped: Semaphore broken"), which
are already filtered by the after_test hook via filter_errors().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1464
Backport: no, tests fix (we may decide to backport later if it occurs on release branches)

Closes scylladb/scylladb#29463

* github.com:scylladb/scylladb:
  test: filter benign errors in tests that grep logs during shutdown
  test: filter_errors: support list[list[str]] error groups
2026-04-15 14:40:15 +03:00
Piotr Szymaniak
5b00675bf0 storage_proxy: expedite speculative retry on replica disconnect
When a replica disconnects during a digest read (e.g., during
decommission), the speculating_read_executor now immediately fires
the pending speculative retry instead of waiting for the timer.

On DISCONNECT, the digest_read_resolver invokes an _on_disconnect
callback set by the executor. The callback cancels the speculate
timer and rearms it to clock_type::now() (lowres_clock::now() =
thread-local memory read, no syscall). The existing timer callback
fires on the next reactor poll with all its logic intact — checking
is_completed(), calling add_wait_targets(1), sending the request,
and incrementing speculative_digest_reads/speculative_data_reads.

The notification is fire-and-forget: on_error() does NOT absorb the
DISCONNECT. The existing error arithmetic in digest_read_resolver
already handles this correctly because _target_count_for_cl accounts
for the speculative target.

For never_speculating_read_executor (no spare target) and
always_speculating_read_executor (all requests sent upfront),
_on_disconnect is never set — no behavior change.

Fixes scylladb/scylladb#26307

Closes scylladb/scylladb#29428
2026-04-15 14:40:15 +03:00
Raphael S. Carvalho
a2eed4bb45 service: Use optimistic replicas in all_sibling_tablet_replicas_colocated
all_sibling_tablet_replicas_colocated was using committed ti.replicas to
decide whether sibling tablets are co-located and merge can be finalized.
This caused a false non-co-located window when a co-located pair was moved
by the load balancer: as both tablets migrate together, their del_transition
commits may land in different Raft rounds. After the first commit, ti.replicas
diverge temporarily (one tablet shows the new position, the other the old),
causing all_sibling_tablet_replicas_colocated to return false. This clears
finalize_resize, allowing the load balancer to start new cascading migrations
that delay merge finalization by tens of seconds.

Fix this by using the optimistic replica view (trinfo->next when transitioning,
ti.replicas otherwise) — the same view the load balancer uses for load
accounting — so finalize_resize stays populated throughout an in-flight
migration and no spurious cascades are triggered.

Steps that lead to the problem:

1. Merge is triggered. The load balancer generates co-location migrations
   for all sibling pairs that are not yet on the same shard. Some pairs
   finish co-location before others.

2. Once all pairs are co-located in committed state,
   all_sibling_tablet_replicas_colocated returns true and finalize_resize
   is set. Meanwhile the load balancer may have already started a regular
   LB migration on one co-located pair (both tablets are stable and the
   load balancer is free to move them).

3. The LB migration moves both tablets together (colocated_tablets). Their
   two del_transition commits land in separate Raft rounds. After the first
   commit, ti.replicas[t1] = new position but ti.replicas[t2] = old position.

4. In this window, all_sibling_tablet_replicas_colocated sees the pair as
   NOT co-located, clears finalize_resize, and the load balancer generates
   new migrations for other tablets to rebalance the load that the pair
   move created.

5. Those new migrations can take tens of seconds to stream, keeping the
   coordinator in handle_tablet_migration mode and preventing
   maybe_start_tablet_resize_finalization from being called. The merge
   finalization is delayed until all those cascaded migrations complete.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-821.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1459.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29465
2026-04-15 14:40:15 +03:00
Marcin Maliszkiewicz
53b6e9fda5 Merge 'Make DESCRIBE CLUSTER get cluster information from storage_service' from Pavel Emelyanov
Currently the statement returns cluster, partitioner and snitch names by accessing global db::config via database. As the part of an effort to detach components from global db::config, this PR tweaks the statement handler to get the cluster information from some other source. Currently the needed cluster information is stored in different components, but they are all under storage_service umbrella which seems to be a good central source of this truth. Unit test included.

Cleaning components inter-dependencies, not backporting

Closes scylladb/scylladb#29429

* github.com:scylladb/scylladb:
  test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
  describe_statement: Get cluster info from storage_service
  storage_service: Add describe_cluster() method
  query_processor: Expose storage_service accessor
2026-04-15 14:40:15 +03:00
Botond Dénes
d0e99e018b reader_concurrency_semaphore: drop unused stop_ext_{pre,post}()
Left over from primordial times, when reader_concurrency_semaphore was
baseclass for extensions in the separate enterprise repository.
Also remove the now unneded virtual marker from the destructor.

Closes scylladb/scylladb#29399
2026-04-15 14:40:15 +03:00
Botond Dénes
4a2d032c6f Merge 'query: result_set: change row member to a chunked vector' from Benny Halevy
To prevent large memory allocations.

This series shows over 3% improvement in perf-simple-query throughput.
```
$ build/release/scylla perf-simple-query --default-log-level=error --smp=1 --random-seed=1855519715
random-seed=1855519715
enable-cache=1
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...

Before:
random-seed=1775976514
enable-cache=1
enable-index-cache=1
sstable-summary-ratio=0.0005
sstable-format=me
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
336345.11 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32788 insns/op,   12430 cycles/op,        0 errors)
348748.14 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32794 insns/op,   12335 cycles/op,        0 errors)
349012.63 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32800 insns/op,   12326 cycles/op,        0 errors)
350629.97 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32770 insns/op,   12270 cycles/op,        0 errors)
348585.00 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32804 insns/op,   12338 cycles/op,        0 errors)
throughput:
        mean=   346664.17 standard-deviation=5825.77
        median= 348748.14 median-absolute-deviation=2348.46
        maximum=350629.97 minimum=336345.11
instructions_per_op:
        mean=   32791.35 standard-deviation=13.60
        median= 32794.47 median-absolute-deviation=8.65
        maximum=32804.45 minimum=32769.57
cpu_cycles_per_op:
        mean=   12340.05 standard-deviation=57.57
        median= 12335.05 median-absolute-deviation=13.94
        maximum=12430.42 minimum=12270.28

After:
random-seed=1775976514
enable-cache=1
enable-index-cache=1
sstable-summary-ratio=0.0005
sstable-format=me
Running test with config: {partitions=10000, concurrency=100, mode=read, query_single_key=no, counters=no}
Disabling auto compaction
Creating 10000 partitions...
353770.85 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32762 insns/op,   11893 cycles/op,        0 errors)
364447.98 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32738 insns/op,   11818 cycles/op,        0 errors)
365268.97 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32734 insns/op,   11788 cycles/op,        0 errors)
344304.87 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32746 insns/op,   12506 cycles/op,        0 errors)
362263.57 tps ( 58.1 allocs/op,   0.0 logallocs/op,  14.1 tasks/op,   32756 insns/op,   11888 cycles/op,        0 errors)
throughput:
        mean=   358011.25 standard-deviation=8916.76
        median= 362263.57 median-absolute-deviation=6436.74
        maximum=365268.97 minimum=344304.87
instructions_per_op:
        mean=   32747.06 standard-deviation=11.85
        median= 32745.80 median-absolute-deviation=9.36
        maximum=32762.18 minimum=32734.01
cpu_cycles_per_op:
        mean=   11978.65 standard-deviation=298.06
        median= 11887.96 median-absolute-deviation=160.96
        maximum=12505.72 minimum=11788.49
```

Refs #28511
(Refs rather than Fixes for the lack of a reproducer unit test)

* No backport needed as the issue is rare and not severe

Closes scylladb/scylladb#28631

* github.com:scylladb/scylladb:
  query: result_set: change row member to a chunked vector
  query: result_set_row: make noexcept
  query: non_null_data_value: assert is_nothrow_move_constructible and assignable
  types: data_value: assert is_nothrow_move_constructible and assignable
2026-04-15 14:40:15 +03:00
Nadav Har'El
1eb8d170dd Merge 'vector_index: allow recreating vector indexes on the same column' from Dawid Pawlik
This series allows creating multiple vector indexes on the same column so users can rebuild an index without losing query availability.

The intended flow is:
1. Create a new vector index on a column that already has one.
2. Keep serving ANN queries from the old index while the new one is being built.
3. Verify the new index is ready.
4. Automatically switch to the remaining index.
5. Drop the old index.

To make that deterministic, `index_version` is changed from the base table schema version to a real creation timeuuid. When multiple vector indexes exist on the same column, ANN query planning now picks the index according to the routing implemented in Vector Store (newest serving index). This keeps queries on the old index until it the new one is up and ready.

This patch also removes the create-time restriction that rejected a second vector index on the same column. Name collisions are still rejected as before.

Test coverage is updated accordingly:
- Scylla now verifies that two vector indexes can coexist on the same column.
- Cassandra/SAI behavior is still covered and is still expected to reject duplicate indexes on the same column.

Fixes: VECTOR-610

Closes scylladb/scylladb#29407

* github.com:scylladb/scylladb:
  docs: document vector index metadata and duplicate handling
  test/cqlpy: cover vector index duplicate creation rules
  vector_index: allow multiple named indexes on one column
  vector_index: store `index_version` as creation timeuuid
2026-04-15 14:40:15 +03:00
Botond Dénes
a9c86fc2e4 docs: document schema subcomponent in sstable-scylla-format.md
Commit 234f905 (sstables: scylla_metadata: add schema member) added a
new Schema subcomponent (tag 11) to scylla_metadata. Document it in the
sstable Scylla format reference:

- Add schema to the subcomponent grammar enumeration
- Add a summary entry describing the subcomponent (tag 11) and its purpose
- Add a detailed ## schema subcomponent section with the binary grammar,
  covering table_id, table_schema_version, keyspace_name, table_name and
  the column_description array (column_kind, column_name, column_type)

Fixes https://github.com/scylladb/scylladb/issues/27960

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#28983
2026-04-15 14:40:15 +03:00
Botond Dénes
5891efc2ca Merge 'service: add missing replicas if tablet rebuild was rolled back' from Aleksandra Martyniuk
RF change of tablet keyspace starts tablet rebuilds. Even if any of the rebuilds is rolled back (because pending replica was excluded), rf change request finishes successfully. In this case we end up with the state of the replicas that isn't compatible with the expected keyspace replication.

Modify topology coordinator so that if it were to be idle, it starts checking if there are any missing replicas. It moves to transition_state::tablet_migration and run required rebuilds.

If a new RF change request encounters invalid state of replicas it fails. The state will be fixed later and the analogical ALTER KEYSPACE statement will be allowed.

Fixes: SCYLLADB-109.

Requires backport to all versions with tablet keyspace rf change.

Closes scylladb/scylladb#28709

* github.com:scylladb/scylladb:
  test: add test_failed_tablet_rebuild_is_retried_on_alter
  test: add a test to ensure that failed rebuilds are retried
  service: fail ALTER KEYSPACE if replicas do not satisfy the replication
  service: retry failed tablet rebuilds
  service: maybe_start_tablet_migration returns std::optional<group0_guard>
2026-04-15 14:40:15 +03:00
David Garcia
0eaa42c846 docs: Makefile: drop redundant -t $(FLAG) from sphinx options
Related  scylladb/scylladb-docs-homepage#153.

make multiversion failed under Sphinx 8+ with:

```
sphinx-build: error: argument --tag/-t: expected one argument
subprocess.CalledProcessError: Command '(..., '-m', 'sphinx', '-t', '-D', 'smv_metadata_path=...', ..., 'manual')' returned non-zero exit status 2.
make: *** [multiversion] Error 1
```

sphinx-multiversion's arg forwarding splits `-t manual`, sending `-t` into the options slot and `manual` to the trailing FILENAMES positional.

Sphinx 7 silently tolerated the dangling `-t`; Sphinx 8+'s stricter
argparse CLI rejects it. Instead, it now reads FLAGS from an env variable.

How to test:

````
make multiversion
make FLAG=opensource multiversion
````

Both complete and switch variants correctly.

chore: rm empty lines

Closes scylladb/scylladb#29472
2026-04-15 14:40:15 +03:00
dependabot[bot]
280ffe107f build(deps): bump sphinx-multiversion-scylla in /docs
Bumps [sphinx-multiversion-scylla](https://holzhaus.github.io/sphinx-multiversion/) from 0.3.7 to 0.3.8.

---
updated-dependencies:
- dependency-name: sphinx-multiversion-scylla
  dependency-version: 0.3.8
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>

Closes scylladb/scylladb#29466
2026-04-15 14:40:15 +03:00
Raphael S. Carvalho
1529605b32 logstor: Fix dangling reference captures and shadowed loc variable
Three bugs fixed in segment_manager.cc:

1. write_to_separator(): captured [&index] where index was a local
   coroutine-frame reference. The future is stored in
   buf.pending_updates and resolved later in flush_separator_buffer(),
   by which time the enclosing coroutine frame is destroyed, making
   &index a dangling pointer. This is a use-after-free that manifests
   as a segfault. Fix: capture index_ptr (raw pointer by value) instead.

2. add_segment_to_compaction_group(): same dangling [&index] pattern
   inside the for_each_live_record lambda during recovery. Same fix
   applied.

3. write(): local 'auto loc = seg->allocate(...)' shadowed the outer
   'log_location loc', causing the function to always return a
   zero-initialized log_location{}. Fix: remove 'auto' so the
   assignment targets the outer variable.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29451
2026-04-15 14:40:15 +03:00
Tomasz Grabiec
266a225416 utils: avoid exceptions in disk_space_monitor polling loop
The poll loop used condition_variable::wait(timeout) to sleep between
iterations. On every normal timeout expiry, this threw a
condition_variable_timed_out exception, which incremented the C++
exception metric and triggered false alerts for support.

Replace the timed wait with a seastar::timer that broadcasts the
condition variable on expiry, combined with an untimed wait(). The
timer is cancelled automatically on scope exit when the wait is woken
early by trigger_poll() or abort.

Fixes SCYLLADB-1477

Closes scylladb/scylladb#29438
2026-04-15 14:40:15 +03:00
Pavel Emelyanov
a428472e50 db: Remove redundant enable_logstor config option
The enable_logstor configuration option is redundant with the 'logstor'
experimental feature flag. Consolidate to a single gate: use the
experimental feature to control both whether logstor is available for
table creation and whether it is initialized at database startup.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29427
2026-04-15 14:40:15 +03:00
Botond Dénes
87eb20ba33 Merge 'cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric' from Tomasz Grabiec
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggregate queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.

No backport: enhancement

Closes scylladb/scylladb#29422

* github.com:scylladb/scylladb:
  cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
  test: cluster: dtest: Fix double-counting of metrics
2026-04-15 14:40:15 +03:00
Botond Dénes
aecb6b1d76 Merge 'auth: sanitize {USER} substitution in LDAP URL template' from Piotr Smaron
`LDAPRoleManager` interpolated usernames directly into `ldap_url_template`,
allowing LDAP filter injection and URL structure manipulation via crafted
usernames.

This PR adds two layers of encoding when substituting `{USER}`:
1. **RFC 4515 filter escaping** — neutralises `*`, `(`, `)`, `\`, NUL
2. **URL percent-encoding** — prevents `%`, `?`, `#` from breaking
   `ldap_url_parse`'s component splitting or undoing the filter escaping
It also adds `validate_query_template()` at startup to reject templates
that place `{USER}` outside the filter component (e.g. in the host or
base DN), where filter escaping would be the wrong defense.
Fixes: SCYLLADB-1309

Compatibility note:
Templates with `{USER}` in the host, base DN, attributes, or extensions
were previously silently accepted. They are now rejected at startup with
a descriptive error. Only templates with `{USER}` in the filter component
(after the third `?`) are valid.

Fixes: SCYLLADB-1309

Due to severeness, should be backported to all maintained versions.

Closes scylladb/scylladb#29388

* github.com:scylladb/scylladb:
  auth: sanitize {USER} substitution in LDAP URL templates
  test/ldap: add LDAP filter-injection reproducers
2026-04-15 14:40:15 +03:00
Artsiom Mishuta
146a67cf6f test: explicitly wait for schema agreement in create_new_test_keyspace
Add an explicit wait_for_schema_agreement() call after CREATE KEYSPACE
in create_new_test_keyspace to ensure all nodes have applied the schema
before proceeding.

Closes scylladb/scylladb#29371
2026-04-15 14:40:15 +03:00
Pavel Emelyanov
54e3c648a5 test/cluster/dtest: improve diagnostics in test_update_schema_while_node_is_killed
The alter_table case has a known failure where point lookups at QUORUM
return 0 rows after node2 restarts, even though:
- the schema was correctly synced (ALTER TABLE received from cluster)
- the data commitlog was replayed (21 mutations, 0 skipped)
- all 3 nodes were alive, so QUORUM (2/3) should be satisfiable by
  node1+node3 regardless of node2's state

The LIMIT 1 table scan succeeds (data is present somewhere), but
specific key lookups return empty. This points to a bug in how node2,
acting as coordinator after restart, routes single-partition reads —
most likely stale tablet routing metadata.

Add diagnostics to help distinguish data loss from a coordinator/routing
bug on the next failure:
- log which key is missing
- dump all rows visible at QUORUM
- query each node individually at ONE consistency for the missing key

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Closes scylladb/scylladb#29350
2026-04-15 14:40:15 +03:00
Piotr Szymaniak
4c93c2af62 audit/alternator: support audit_tables=alternator.<table> shorthand
The real keyspace name of an Alternator table T is "alternator_T".
Expand the "alternator.T" format used in the audit_tables config flag
to the real keyspace name at parse time, so users don't need to spell
out the internal "alternator_T.T" form.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
0714d8aded audit/alternator: Add negative audit tests
Add tests for the unhappy path of Alternator audit logging:
- Category filtering: operations are not logged when their category
  (DML, QUERY, DDL) is excluded from audit_categories.
- Keyspace filtering: operations on a keyspace not listed in
  audit_keyspaces are not logged.
- Error entries: a failed operation (thrown exception after audit_info
  is set) produces an audit entry with error=true.
- Empty-keyspace bypass: global operations like ListTables and
  DescribeEndpoints are logged regardless of audit_keyspaces because
  should_log() short-circuits on an empty keyspace.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
ad05b44931 audit/alternator: Add testing of auditing
There is a new test file created, `test/alternator/test_audit.py`.
The file contains a suite of tests of all auditing operations.
2026-04-15 12:29:15 +02:00
Piotr Szymaniak
6913efab5c audit/alternator: Audit requests
Both the successful ones as well as the failed ones are audited.

Each Alternator operation sets up audit metadata via an
executor::maybe_audit() helper, which checks will_log() and only
heap-allocates audit_info_alternator when auditing is enabled. DDL
and metadata operations pass no consistency level; data read/write
operations pass the actual CL used.

BatchWriteItem and BatchGetItem guard table name collection with
will_log() to avoid unnecessary work when auditing is disabled.
ListStreams audits the input table name rather than collecting
output table names during iteration. UntagResource sets up
auditing after parameter validation. Exception re-throw in
server.cc uses co_return coroutine::exception().

The chosen audit types for the operations:
- CreateTable - DDL
- DescribeTable - QUERY
- DeleteTable - DDL
- UpdateTable - DDL
- PutItem - DML
- UpdateItem - DML
- GetItem - QUERY
- DeleteItem - DML
- ListTables - QUERY
- Scan - QUERY
- DescribeEndpoints - QUERY
- BatchWriteItem - DML
- BatchGetItem - QUERY
- Query - QUERY
- TagResource - DDL
- UntagResource - DDL
- ListTagsOfResource - QUERY
- UpdateTimeToLive - DDL
- DescribeTimeToLive - QUERY
- ListStreams - QUERY
- DescribeStream - QUERY
- GetShardIterator - QUERY
- GetRecords - QUERY
- DescribeContinuousBackups - QUERY
2026-04-15 11:55:42 +02:00
Piotr Szymaniak
9646ee05bd audit/alternator: Refactor in preparation for auditing Alternator
Prepare API in audit for auditing Alternator.
The API provides an externally-callable functions `inspect()`,
for both CQL and Alternator.
Both variants of the function would unpack parameters and merge into
calling a common `maybe_log()`, which can then call `log()` when
conditions are met.
Also, while I was at it, (const) references were favoured over raw
pointers.

The Alternator audit_info subclass (audit_info_alternator) carries an
optional consistency level — only data read/write operations have a
meaningful CL, while DDL and metadata queries store an empty string
in the audit table and syslog (matching the existing write_login
behavior). The storage helpers are updated accordingly.

Add a will_log(category, keyspace, table) method that checks whether
an operation should be audited (category check AND keyspace/table
filtering) without requiring a constructed audit_info object.
should_log() delegates to will_log().
2026-04-15 11:46:44 +02:00
Tomasz Grabiec
84361194c2 test: boost: tablets: Add test for merge with arbitrary tablet count 2026-04-15 10:40:56 +02:00
Tomasz Grabiec
7af9f5366d tablets, database: Advertise 'arbitrary' layout in snapshot manifest
Currently, the manifest advertises "powof2", which is wrong for
arbitrary count and boundaries.

Introduce a new kind of layout called "arbitrary", and produce it if
the tablet map doesn't conform to "powof2" layout.

We should also produce tablet boundaries in this case, but that's
worked on in a different PR: https://github.com/scylladb/scylladb/pull/28525
2026-04-15 10:40:56 +02:00
Tomasz Grabiec
50fbac6ea6 tablets: Introduce pow2_count per-table tablet option
By default it's true, in which case tablet count of the table is
rounded up to a power of two. This option allows lifting this, in
which case the count can be arbitrary. This will allow testing the
logic of arbitrary tablet count.
2026-04-15 10:40:56 +02:00
Tomasz Grabiec
b6a7023f68 tablets: Prepare for non-power-of-two tablet count
This is a step towards more flexibility in managing tablets.  A
prerequisite before we can split individual tablets, isolating hot
partitions, and evening-out tablet sizes by shifting boundaries.

After this patch, the system can handle tables with arbitrary tablet
count. Tablet allocator is still rounding up desired tablet count to
the nearest power of two when allocating tablets for a new table, so
unless the tablet map is allocated in some other way, the counts will
be still a power of two.

We plan to utilize arbitrary count when migrating from vnodes to
tablets, by creating a tablet map which matches vnode boundaries.

One of the reasons we don't give up on power-of-two by default yet is
that it creates an issue with merges. If tablet count is odd, one of
the tablets doesn't have a sibling and will not be merged. That can
obviously cause imbalance of token space and tablet sizes between
tablets. To limit the impact, this patch dynamically chooses which
tablet to isolate when initiating a merge. The largest tablet is
chosen, as that will minimize imbalance. Otherwise, if we always chose
the last tablet to isolate, its size would remain the same while other
tablets double in size with each odd-count merge, leading to
imbalance. The imbalance will still be there, but the difference in
tablet sizes is limited to 2x.

Example (3 tablets):
  [0] owns 1/3 of tokens
  [1] owns 1/3 of tokens
  [2] owns 1/3 of tokens

After merge:
  [0] owns 2/3 of tokens
  [1] owns 1/3 of tokens

What we would like instead:

Step 1 (split [1]):
  [0] owns 1/3 of tokens
  [1] old 1.left, owns 1/6 of tokens
  [2] old 1.right, owns 1/6 of tokens
  [3] owns 1/3 of tokens

Step 2 (merge):
  [0] owns 1/2 of tokens
  [1] owns 1/2 of tokens

To do that, we need to be able to split individual tablets, but we're
not there yet.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
f54daef4ec tablets: Implement merged tablet_map constructor on top of for_each_sibling_tablets()
This way it doesn't need to know how the scheduler chose to merge tablets.
We'll have less duplication of logic.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
66fc7967b8 tablets: Prepare resize_decision to hold data in decisions
merge decision will carry a plan - which replica to isolate.
So construction from a string will no longer do.
2026-04-15 10:40:55 +02:00
Tomasz Grabiec
d543f260bd tablets: table: Make storage_group handle arbitrary merge boundaries
We only assume that new tablets have boundaries which are equal
to some boundaries of old tablets.

In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
2026-04-15 10:40:55 +02:00
Nadav Har'El
022add117e test/cluster: fix flaky test test_row_ttl_scheduling_group
The test test/cluster/test_ttl_row.py::test_row_ttl_scheduling_group wants to
verify that the new CQL per-row TTL feature does all its work (expiration
scanning, deletion of expired items) on all nodes in the "streaming"
scheduling group, not in the statement scheduling group.

As originally written, the test couldn't require that it uses exactly zero
time in the statement scheduling group - because some things do happen
there - specifically the ALTER TABLE request we use to enable TTL.
So the test checked that the time in the "wrong" group is less than 0.2
of the total time, not zero.

But in one CI run, we got to exactly 0.2 and the test failed. Running
this test locally, I see the margin is pretty narrow: The test almost
always fails if I set the threshold ratio to 0.1.

The solution in this patch is to move the ALTER TABLE work to a different
scheduling group (by using an additional service level). After doing that
the CPU usage in sl:default goes down to exactly zero - not close to zero
but exactly zero.

However, it seems that there is always some rare background work in
sl:default and debug builds it can come out more than 0ms (e.g., in
one test we saw 1ms), so we keep checking that sl:default is much
lower than sl:stream - not exactly zero.

Incidentally, I converted the serial loop adding the 200 rows in the
test's setup to a parallel loop, to make the test setup slightly faster.

I also added to the test a sanity check that the scheduling group sl:default
that we are measuring that TTL does zero work in, is actually the scheduling
group that normal writes work in (to avoid the risk of having a test that
verifies that some irrelevant scheduling group is unsurprisingly getting
zero usage...).

Fixes SCYLLADB-1495.

Closes scylladb/scylladb#29447
2026-04-15 08:42:29 +03:00
Jenkins Promoter
3d0582d51e Update pgo profiles - aarch64 2026-04-15 05:26:22 +03:00
Jenkins Promoter
a4d3ab9f0e Update pgo profiles - x86_64 2026-04-15 04:26:28 +03:00
Tomasz Grabiec
6d510bcd1c tablets: Make stats update post-merge work with arbitrary merge boundaries
We only assume that new tablets share boundaries with some old tablets.

In preparation for supporting arbitrary merge plan, where any replica
can be isolated (not merged with siblings) by the merge plan.
2026-04-15 01:25:16 +02:00
Tomasz Grabiec
01fb97ee78 locator: tablets: Support arbitrary tablet boundaries
There are several reasons we want to do that.

One is that it will give us more flexibility in distributing the
load. We can subdivide tablets at any points, and achieve more
evenly-sized tablets. In particular, we can isolate large partitions
into separate tablets.

Another reason is vnode-to-tablet migration. We could construct a
tablet map which matches exactly the vnode boundaries, so migration
can happen transparently from the CQL-coordinator's point of view.

Implementation details:

We store a vector of tokens which represent tablet boundaries in the
tablet_id_map. tablet_id keeps its meaning, it's an index into vector
of tablets. To avoid logarithmic lookup of tablet_id from the token,
we introduce a lookup structure with power-of-two aligned buckets, and
store the tablet_id of the tablet which owns the first token in the
bucket. This way, lookup needs to consider tablet id range which
overlaps with one bucket. If boundaries are more or less aligned,
there are around 1-2 tablets overlapping with a bucket, and the lookup
is still O(1).

Amount of memory used increased, but not significantly relative to old
size (because tablet_info is currently fat):

For 131'072 tablets:

Before:

  Size of tablet_metadata in memory: 57456 KiB

After:

  Size of tablet_metadata in memory: 59504 KiB
2026-04-15 01:25:14 +02:00
Tomasz Grabiec
82acdae74b locator: tablets: Introduce tablet_map::get_split_token()
And reimplement existing split-related methods around it.
This way we avoid calling dht::compaction_group_of(), and
assuming anything about tablet boundaries or tablet count
being a power of two.

This will make later refactoring easier.
2026-04-15 01:24:48 +02:00
Tomasz Grabiec
2e1d41c206 dht: Introduce get_uniform_tokens() 2026-04-15 01:24:48 +02:00
Tomasz Grabiec
a58243bc1e Merge 'hint_sender: send hints to all tablet replicas if the tablet leaving due to RF--' from Ferenc Szili
Currently, hints that are sent to tablet replicas which are leaving due to RF-- can be lost, because `hint_sender` only checks if the destination host is leaving. To avoid this, we add a new method `effective_replication_map::is_leaving(host, token)` which checks if the tablet identified by the given token is leaving the host. This method is called by the `hint_sender` to check if the hint should be sent only to the destination host, or to all the replicas. This way, we increase consistency. For v-node based ERPs, `is_leaving()` calls `token_metadata::is_leaving(host)`.

Fixes: SCYLLADB-287

This is an improvement, and backport is not needed.

Closes scylladb/scylladb#28770

* github.com:scylladb/scylladb:
  test: verify hints are delivered during tablet RF reduction
  hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
  erm: add is_leaving() to effective_replication_map
2026-04-14 22:51:34 +02:00
Tomasz Grabiec
7fe4ae16f0 Merge 'table: don't create new split compaction groups if main compaction group is disabled' from Ferenc Szili
Fixes a race condition where tablet split can crash the server during truncation.

`truncate_table_on_all_shards()` disables compaction on all existing compaction groups, then later calls `discard_sstables()` which asserts that compaction is disabled. Between these two points, tablet split can call `set_split_mode()`, which creates new compaction groups via `make_empty_group()` — these start with `compaction_disabled_counter == 0`. When `discard_sstables()` checks its assertion, it finds these new groups and fires `on_internal_error`, aborting the server.

In `storage_group::set_split_mode()`, before creating new compaction groups, check whether the main compaction group has compaction disabled. If it does, bail out early and return `false` (not ready). This is safe because the split will be retried once truncation completes and re-enables compaction.

A new regression test `test_split_emitted_during_truncate` reproduces the
exact interleaving using two error injection points:

- **`database_truncate_wait`** — pauses truncation after compaction is disabled but before `discard_sstables()` runs.
- **`tablet_split_monitor_wait`** (new, in `service/storage_service.cc`) — pauses the split monitor at the start of `process_tablet_split_candidate()`.

The test creates a single-tablet table, triggers both operations, uses the injection points to force the problematic ordering, then verifies that truncation completes successfully and the split finishes afterward.

Fixes: SCYLLADB-1035

This needs to be backported to all currently supported version.

Closes scylladb/scylladb#29250

* github.com:scylladb/scylladb:
  test: add test_split_emitted_during_truncate
  table: fix race between tablet split and truncate
2026-04-14 22:00:40 +02:00
Avi Kivity
21d9f54a9a partition_snapshot_row_cursor: fix reversed maybe_refresh() losing latest version entry
In partition_snapshot_row_cursor::maybe_refresh(), the !is_in_latest_version()
path calls lower_bound(_position) on the latest version's rows to find the
cursor's position in that version. When lower_bound returns null (the cursor
is positioned above all entries in the latest version in table order), the code
unconditionally sets _background_continuity = true and allows the subsequent
if(!it) block to erase the latest version's entry from the heap.

This is correct for forward traversal: null means there are no more entries
ahead, so removing the version from the heap is safe.

However, in reversed mode, null from lower_bound means the cursor is above
all entries in table order -- those entries are BELOW the cursor in query
order and will be visited LATER during reversed traversal. Erasing the heap
entry permanently loses them, causing live rows to be skipped.

The fix mirrors what prepare_heap() already does correctly: when lower_bound
returns null in reversed mode, use std::prev(rows.end()) to keep the last
entry in the heap instead of erasing it.

Add test_reversed_maybe_refresh_keeps_latest_version_entry to mvcc_test,
alongside the existing reversed cursor tests. The test creates a two-version
partition snapshot (v0 with range tombstones, v1 with a live row positioned
below all v0 entries in table order), and
traverses in reverse calling maybe_refresh() at each step -- directly
exercising the buggy code path. The test fails without the fix.

The bug was introduced by 6b7473be53 ("Handle non-evictable snapshots",
2022-11-21), which added null-iterator handling for non-evictable snapshots
(memtable snapshots lack the trailing dummy entry that evictable snapshots
have). prepare_heap() got correct reversed-mode handling at that time, but
maybe_refresh() received only forward-mode logic.

The bug is intermittent because multiple mechanisms cause iterators_valid()
to return false, forcing maybe_refresh() to take the full rebuild path via
prepare_heap() (which handles reversed mode correctly):
  - Mutation cleaner merging versions in the background (changes change_mark)
  - LSA segment compaction during reserve() (invalidates references)
  - B-tree rebalancing on partition insertion (invalidates references)
  - Debug mode's always-true need_preempt() creating many multi-version
    partitions via preempted apply_monotonically()

A dtest reproducer confirmed the same root cause: with 100K overlapping range
tombstones creating a massively multi-version memtable partition (287K preemption
events), the reversed scan's latest_iterator was observed jumping discontinuously
during a version transition -- the latest version's heap entry was erased --
causing the query to walk the entire partition without finding the live row.

Fixes: SCYLLADB-1253

Closes scylladb/scylladb#29368
2026-04-14 21:50:25 +02:00
Nadav Har'El
986167a416 Merge 'cql3: fix authorization bypass via BATCH prepared cache poisoning' from Marcin Maliszkiewicz
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.

Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.

Introduced in 98f5e49ea8

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221

Backport: all supported versions

Closes scylladb/scylladb#29432

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for BATCH prepared auth cache bypass
  cql3: fix authorization bypass via BATCH prepared cache poisoning
2026-04-14 22:31:54 +03:00
Pavel Emelyanov
cec44dc68d test: Add test_describe_cluster_sanity for DESCRIBE CLUSTER validation
Add parametrized integration test that verifies DESCRIBE CLUSTER returns correct
values in both normal and maintenance modes:

The parametrization keeps the validation logic (CQL queries and assertions)
identical for both modes, while the setup phase is mode-specific. This ensures
the same assertions apply to both cluster states:
- partitioner is org.apache.cassandra.dht.Murmur3Partitioner
- snitch is org.apache.cassandra.locator.SimpleSnitch
- cluster name matches system.local cluster_name

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:33:21 +03:00
Pavel Emelyanov
debfb147f5 describe_statement: Get cluster info from storage_service
Update cluster_describe_statement::describe() to retrieve cluster metadata
from storage_service::describe_cluster() instead of directly from db::config
or gossiper.

The storage_service provides a centralized API for accessing cluster metadata
(cluster_name, partitioner, snitch_name) that works in both normal and
maintenance modes, improving separation of concerns.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:33:06 +03:00
Pavel Emelyanov
53361358ef storage_service: Add describe_cluster() method
Add cluster_info struct containing cluster_name, partitioner, and snitch_name.
Implement describe_cluster() method to provide cluster metadata by combining
data from gossiper (cluster_name, partitioner) and snitch (snitch_name).

It will be used by next patch

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:29:24 +03:00
Pavel Emelyanov
0d4a8a04ec query_processor: Expose storage_service accessor
Add storage_service() method to expose the sharded storage service to callers.
To be used by next patch.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-14 19:29:11 +03:00
Radosław Cybulski
4b984212ba alternator: improve parsing / generating of StreamArn parameter
Previously Alternator, when emit Amazon's ARN would not stick to the
standard. After our attempt to run KCL with scylla we discovered few
issues.

Amazon's ARN looks like this:

arn:partition:service:region:account-id:resource-type/resource-id

for example:

arn:aws:dynamodb:us-west-2:111122223333:table/TestTable/stream/2015-05-11T21:21:33.291

KCL checks for:
- ARN provided from Alternator calls must fit with basic Amazon's ARN
  pattern shown above,
- region constisting only of lower letter alphabets and `-`, no
  underscore character
- account-id being only digits (exactly 12)
- service being `dynamodb`
- partition starting with `aws`

The patch updates our code handling ARNs to match those findings.

1. Split `stream_arn` object into `stream_arn` - ARN for streams only and
`stream_shard_id` - id value for stream shards. The latter receives original
implementation. The former emits and parses ARN in a Amazon style.
 for example:
2. Update new `stream_arn` class to encode keyspace and table together
separating them by `@`. New ARN looks like this:

arn:aws:dynamodb:us-east-1:000000000000:table/TestKeyspace@TestTable/stream/2015-05-11T21:21:33.291

3. hardcode `dynamodb` as service, `aws` as partition, `us-east-1` as
   region and `000000000000` as account-id (must have 12 digits)
4. Update code handling ARNs for tags manipulation to be able to parse
   Amazon's style ARNs. Emiting code is left intact - the parser is now
   capable of parsing both styles.
5. Added unit tests.

Fixes #28350
Fixes: SCYLLADB-539
Fixes: #28142

Closes scylladb/scylladb#28187
2026-04-14 18:07:05 +03:00
Marcin Maliszkiewicz
de19714763 Merge 'cql3: prepare list statments metadta_id during prepare statement , send the correct metadata_id directly to the client ' from Alex Dathskovsky
This series makes result metadata handling for auth LIST statements consistent and adds coverage for the driver-visible behavior.

The first patch makes the result-column metadata construction shared across the affected statements, so the metadata shape used for PREPARE and EXECUTE stays uniform and easier to reason about.

The second patch adds regression coverage for both sides of the metadata-id flow:

- a Python auth-cluster test verifies that prepared LIST ROLES OF returns a non-empty result metadata id and that a later EXECUTE reuses it without METADATA_CHANGED
- a Boost transport test covers the recovery path where the client sends an empty request metadata id and the server responds with METADATA_CHANGED and the full metadata

Together these patches tighten the implementation and protect the prepared-metadata-id behavior exposed to drivers.

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-1218

backport: this change should be backported to all active branches to help the driver operation

Closes scylladb/scylladb#29347
2026-04-14 16:09:49 +02:00
bitpathfinder
c1315f9f1e commitlog: add test to verify segment replay order
Add a boost test that verifies commitlog segments are replayed in
ascending segment ID order within each shard. The test creates
multiple segments, triggers replay via commitlog_replayer, and
captures the "Replaying" debug log messages to verify the order.

Correct segment ordering is required by the strongly consistent
tables feature, particularly commitlog-based storage that relies
on replayed raft items being stored in order.

Ref SCYLLADB-1411.
2026-04-14 16:06:13 +02:00
bitpathfinder
c06adffd6a commitlog: fix replay order by using ordered map per shard
The commitlog replayer groups segments by shard using a
std::unordered_multimap, then iterates per-shard segments via
equal_range(). However, equal_range() does not guarantee iteration
order for elements with the same key, so segments could be replayed
out of order within a shard. This can increase memory and disk
consumption during fragmented entry reconstruction, which accumulates
fragments across segments and benefits from ascending ID order.

This is also required by the strongly consistent tables feature,
particularly commitlog-based storage that relies on replayed raft
items being stored in order.

Fix by changing the data structure from
  std::unordered_multimap<unsigned, commitlog::descriptor>
to
  std::unordered_map<unsigned, utils::chunked_vector<commitlog::descriptor>>

Since the descriptors are inserted from a std::set ordered by ID, the
vector preserves insertion (and thus ID) order. The per-shard iteration
now simply iterates the vector, guaranteeing correct replay order.

Fixes SCYLLADB-1411.
2026-04-14 16:05:17 +02:00
Anna Stuchlik
633297b15d doc: remove an oudated troubleshooting page
Fixes https://github.com/scylladb/scylladb/issues/29405

Closes scylladb/scylladb#29431
2026-04-14 15:14:32 +03:00
Ernest Zaslavsky
0eb6270c82 ci: add build system comparison workflow
Add a GitHub Actions workflow that runs scripts/compare_build_systems.py
on PRs touching build system files (configure.py, **/CMakeLists.txt,
cmake/**).
This prevents future deviations between the two build systems by
catching mismatches early in the CI pipeline.

Closes scylladb/scylladb#29426
2026-04-14 14:53:12 +03:00
Avi Kivity
4a9fdb17f0 build: cmake: fix -fno-sanitize-address-use-after-scope for CQL parser
The CMake build had -fsanitize-address-use-after-scope (enable) when
it should have been -fno-sanitize-address-use-after-scope (disable).

The comment on lines 24-25 of cql3/CMakeLists.txt explains the intent:
the use-after-scope sanitizer uses too much stack space on CqlParser
and overflows the stack. The Python-ninja path in configure.py:2801-2802
correctly had -fno-sanitize-address-use-after-scope.

Found by black-box comparison of compiler flags between the Python-ninja
and CMake build paths (ninja -nv output, debug mode, CqlParser.o):

  Python-ninja: -fno-sanitize-address-use-after-scope  (correct: disable)
  CMake:        -fsanitize-address-use-after-scope      (wrong: enable)

Closes scylladb/scylladb#29439
2026-04-14 14:48:52 +03:00
Avi Kivity
ebdfa10c8f test: fix flaky test_incremental_repair_race_window_promotes_unrepaired_data
The test waited for two "Finished tablet repair" log messages on the
coordinator, expecting one per tablet.  But there are two log sources
that emit messages matching this pattern:

  repair module (repair/repair.cc:2329):
    "Finished tablet repair for table=..."
  topology coordinator (topology_coordinator.cc:2083):
    "Finished tablet repair host=..."

When the coordinator is also a repair replica (always the case with
RF=3 and 3 nodes), both messages appear in the coordinator log for the
same tablet within 1ms of each other.  The test consumed both, thinking
both tablets were done, while the second tablet repair was still running.

From the CI failure logs:

  04:08:09.658 Found: repair[...]: Finished tablet repair for table=...
    global_tablet_id=e42fd650-3542-11f1-9756-85403784a622:0
  04:08:09.660 Found: raft_topology - Finished tablet repair host=...
    tablet=e42fd650-3542-11f1-9756-85403784a622:0

Both messages are for tablet :0.  Tablet :1 repair had not finished yet.

The test then wrote keys 20-29 while the second tablet repair was still
in progress.  That repair flushed the memtable (via
prepare_sstables_for_incremental_repair), including keys 20-29 in the
repair scan, and mark_sstable_as_repaired set repaired_at=2 on the
resulting sstable.  This caused the assertion failure on servers[0]:
  "should not have post-repair keys in repaired sstables, got:
   {20, 21, 22, 23, 24, 25, 26, 27, 28, 29}"

Fix by matching "Finished tablet repair host=" which is unique to the
topology coordinator message and avoids the ambiguity.

Also fix an incorrect comment that said being_repaired=null when at that
point in the test being_repaired is still set to the session_id (the
delay_end_repair_update injection prevents end_repair from running).

Fixes: SCYLLADB-1478

Closes scylladb/scylladb#29444
2026-04-14 13:32:51 +02:00
Piotr Dulikowski
9fc2c65d18 Merge 'cql3: implement WRITETIME() and TTL() of individual elements of map, set, and UDT' from Nadav Har'El
In commit 727f68e0f5 we added the ability to SELECT:

* Individual elements of a map: `SELECT map_col[key]`.
* Individual elements of a set: `SELECT set_col[key]` returns key if the key exists in the set, or null if it doesn't, allowing to check if the element exists in the set.
* Individual pieces of a UDT: `SELECT udt_col.field`.

But at the time, we didn't provide any way to retrieve the **meta-data** for this value, namely its timestamp and TTL. We did not support `SELECT TIMESTAMP(collection[key])`, or `SELECT TIMESTAMP(udt.field)`.

Users requested to support such SELECTs in the past (see issue #15427), and Cassandra 5.0 added support for this feature - for both maps and sets and udts - so we also need this feature for compatibility. This feature was also requested recently by vector-search developers, who wanted to read Alternator columns - stored as map elements, not individual columns - with their WRITETIME information.

The first four patches in this series adds the feature (in four smaller patches instead one big one), the fifth and sixth patches add tests (cqlpy and boost tests, respectively). The seventh patch adds documentation.

All the new tests pass on Cassandra 5, failed on Scylla before the present fix, and pass with it.

The fix was surprisingly difficult. Our existing implementation (from 727f68e0f5 building on earlier machinery) doesn't just "read" `map_col[key]` and allow us to return just its timestamp. Rather, the implementation reads the entire map, serializes it in some temporary format that does **not** include the timestamps and ttls, and then takes the subscript key, at which point we no longer have the timestamp or ttl of the element. So the fix had to cross all these layers of the implementation.

While adding support for UDT fields in a pre-existing grammar nonterminal "subscriptExpr", we unintentionally added support for UDT fields also in LWT expressions (which used this nonterminal). LWT missing support for UDT fields was a long-time known compatibility issue (#13624) so we unintentionally fixed it :-) Actually, to completely fix it we needed another small change in the expression implementation, so the eighth patch in this series does this.

Fixes #15427
Fixes #13624

Closes scylladb/scylladb#29134

* github.com:scylladb/scylladb:
  cql3: support UDT fields in LWT expressions
  cql3: document WRITETIME() and TTL() for elements of map, set or UDT
  test/boost: test WRITETIME() and TTL() on map collection elements
  test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
  cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
  cql3: parse per-element timestamps/TTLs in the selection layer
  cql3: add extended wire format for per-element timestamps and TTLs
  cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
2026-04-14 12:35:46 +02:00
Dawid Pawlik
f40ab83d02 docs: document vector index metadata and duplicate handling
Document the new vector index behavior in the user-facing and developer
docs.

Describe `index_version` as a creation timeuuid stored in
`system_schema.indexes`, clarify that recreating an index changes it
while ALTER TABLE does not, and document that Scylla allows multiple
named vector indexes on the same column while still rejecting unnamed
duplicates.
2026-04-14 12:21:38 +02:00
Dawid Pawlik
800dec2180 test/cqlpy: cover vector index duplicate creation rules
Add cqlpy tests for the current CREATE INDEX behavior of vector indexes.

Cover named and unnamed duplicates, IF NOT EXISTS, coexistence of
multiple named vector indexes on the same column, interactions between
named and unnamed indexes, and the same-name-on-different-table case.
2026-04-14 12:21:38 +02:00
Marcin Maliszkiewicz
db5e4f2cb8 test/cqlpy: add reproducer for BATCH prepared auth cache bypass
An unprivileged user could bypass authorization checks by exploiting
the BATCH prepared statement cache:

1. Prepare an INSERT on a table the user has no access to
2. Execute it inside a BATCH — gets Unauthorized
3. Execute the same prepared INSERT directly — succeeds
2026-04-14 10:37:42 +02:00
Marcin Maliszkiewicz
8401e9cbbd test: filter benign errors in tests that grep logs during shutdown
Apply filter_errors() to grep_for_errors() results in
test_split_stopped_on_shutdown and
test_group0_apply_while_node_is_being_shutdown. Without filtering,
benign RPC errors like 'connection dropped: Semaphore broken' that
occur during graceful shutdown cause spurious test failures.
2026-04-13 18:33:41 +02:00
Marcin Maliszkiewicz
e78e6cd584 test: filter_errors: support list[list[str]] error groups
Accept both list[str] (from distinct_errors=True) and
list[list[str]] (from distinct_errors=False) in filter_errors(),
matching against the first line of each error group. This allows
tests that call grep_for_errors() with default arguments to
pipe results directly through filter_errors().
2026-04-13 18:33:29 +02:00
Alex
fdce8824a5 test/cluster: cover prepared LIST metadata ids in one setup
Precompute the expected metadata-id hashes for the prepared LIST auth and
service-level statements and verify that PREPARE returns them while EXECUTE
reuses the prepared metadata without METADATA_CHANGED. Run all cases in a
single auth-cluster test after preparing the cluster, role, and service level
once through the regular manager fixture.
2026-04-13 19:13:12 +03:00
Marcin Maliszkiewicz
4d3ca041bb cql3: fix authorization bypass via BATCH prepared cache poisoning
execute_batch_without_checking_exception_message() inserted entries
into the authorized prepared cache before verifying that
check_access() succeeded. A failed BATCH therefore left behind
cached 'authorized' entries that later let a direct EXECUTE of the
same prepared statement skip the authorization check entirely.

Move the cache insertion after the access check so that entries are
only cached on success. This matches the pattern already used by
do_execute_prepared() for individual EXECUTE requests.

Introduced in 98f5e49ea8

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1221
2026-04-13 17:57:22 +02:00
Alex
0f6d9ffd22 cql: expose stable result metadata for prepared LIST statements
Prepared LIST statements were not calculating metadata in PREPARE path, and sent empty string hash to client causing problematic behaviour where metadat_id was not recalculated correctly.
This patch moves metadata construction into get_result_metadata() for the affected LIST statements and reuse that metadata when building the result set.
This gives PREPARE a stable metadata id for LIST ROLES, LIST USERS, LIST PERMISSIONS and the service-level variants.
This patch also adds a new boost test that verifies that when an EXECUTE request carries an empty result metadata id while the server has a real metadata id for the result set, the response is marked METADATA_CHANGED and includes the full result metadata plus the server metadata id.
This covers the recovery path for clients that send an empty or otherwise unusable metadata id instead of a matching cached one.
2026-04-13 17:49:27 +03:00
Dawid Pawlik
63b782451e vector_index: allow multiple named indexes on one column
Allow creating multiple named vector indexes on the same column while
still rejecting duplicate unnamed ones.

`index_metadata::equals_noname()` now ignores `index_version`,
which is unique for every vector index creation, so duplicate detection
keeps working for unnamed vector indexes.

CREATE INDEX keeps using structural duplicate detection for regular
indexes and unnamed vector indexes, but named vector indexes are checked
by name only.

The explicit name check is also needed for IF NOT EXISTS when the same
index name already exists on a different table in the same keyspace,
because vector indexes have no backing view table to catch that case.
2026-04-13 15:04:59 +02:00
Ferenc Szili
e904e7a715 test: add test_split_emitted_during_truncate
Add a regression test that reproduces the race between tablet split and
truncation. The test:

1. Creates a single-tablet table and inserts data.
2. Triggers truncation and pauses it (via database_truncate_wait) after
   compaction is disabled but before discard_sstables() runs.
3. Triggers tablet split and pauses it (via tablet_split_monitor_wait)
   at the start of process_tablet_split_candidate().
4. Releases split so set_split_mode() creates new compaction groups.
5. Waits for the set_split_mode log confirming the groups exist.
6. Releases truncation so discard_sstables() encounters the new groups.
7. Verifies truncation completes and split finishes.

Adds a tablet_split_monitor_wait error injection point in
process_tablet_split_candidate() to allow pausing the split monitor
before it enters the split loop.
2026-04-13 11:05:03 +02:00
Ferenc Szili
13d9561398 table: fix race between tablet split and truncate
Tablet split can call set_split_mode() between the point where
truncate_table_on_all_shards() disables compaction on all existing
compaction groups and the point where discard_sstables() checks that
compaction is disabled. The new split-ready compaction groups created
by set_split_mode() won't have compaction disabled, causing
discard_sstables() to fire on_internal_error.

Fix by preventing set_split_mode() from creating new compaction groups
when compaction is disabled on the main group. If truncation has
already disabled compaction, split will simply report not-ready rather
than creating groups which have compaction enabled.

This is safe because split will be retried once truncation completes
and re-enables compaction.
2026-04-13 11:04:38 +02:00
Nadav Har'El
33dbb63aef cql3: support UDT fields in LWT expressions
In an earlier patch, we used the CQL grammar's "subscriptExpr" in
the rule for WRITETIME() and TTL(). But since we also wanted these
to support UDT fields (x.a), not just collection subscripts (x[3]),
we expanded subscriptExpr to also support the field syntax.

But LWT expressions already used this subscriptExpr, which meant
that LWT expressions unintentionally gained support for UDT fields.
Missing support for UDT fields in LWT is a long-standing known
Cassandra-compatibility bug (#13624), and now our grammar finally
supports the missing syntax.

But supporting the syntax is not enough for correct implementation
of this feature - we also need to fix the expression handling:

Two bugs prevented expressions like `v.a = 0` from working in LWT IF
clauses, where `v` is a column of user-defined type.

The first bug was in get_lhs_receiver() in prepare_expr.cc: it lacked
a handler for field_selection nodes, causing an "unexpected expression"
internal error when preparing a condition like `IF v.a = 0`. The fix
adds a handler that returns a column_specification whose type is taken
from the prepared field_selection's type field.

The second bug was in search_and_replace() in expression.cc: when
recursing into a field_selection node it reconstructed it with only
`structure` and `field`, silently dropping the `field_idx` and `type`
fields that are set during preparation. As a result, any transformation
that uses search_and_replace() on a prepared expression containing a
field_selection — such as adjust_for_collection_as_maps() called from
column_condition_prepare() — would zero out those fields. At evaluation
time, type_of() on the field_selection returned a null data_type
pointer, causing a segmentation fault when the comparison operator tried
to call ->equal() through it. The fix preserves field_idx and type when
reconstructing the node.

Fixes #13624.
2026-04-12 14:28:01 +03:00
Nadav Har'El
bb2fb810bb cql3: document WRITETIME() and TTL() for elements of map, set or UDT
Add to the SELECT documentation (docs/cql/dml/select.rst) documentation
of the new ability to select WRITETIME() and TTL() of a single element
of map, set or UDT.

Also in the TTL documentation (docs/cql/time-to-live.rst), which already
had a section on "TTL for a collection", add a mention of the ability
to read a single element's TTL(), and an example.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
a544dae047 test/boost: test WRITETIME() and TTL() on map collection elements
Add tests in test/boost/expr_test.cc for the low-level implementation
of writetime() and ttl() on a map element.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:28:01 +03:00
Nadav Har'El
ccb94618cc test/cqlpy: test WRITETIME() and TTL() on element of map, set or UDT
This patch adds many tests verifying the behavior of WRITETIME() and
TTL() on individual elements of maps, sets and UDTs, serving as a
regression test for issue #15427. We also add tests verifying our
understanding of related issues like WRITETIME() and TTL() of entire
collections and of individual elements of *frozen* collections.

All new tests pass on Cassandra 5.0, helping to verify that our
implementation is compatible with Cassandra. They also pass on
ScyllaDB after the previous patch (most didn't before that patch).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-04-12 14:27:40 +03:00
Nadav Har'El
35e807a36c cql3: prepare and evaluate WRITETIME/TTL on collection elements and UDT fields
Complete the implementation of SELECT WRITETIME(col[key])/TTL(col[key])
and WRITETIME(col.field)/TTL(col.field), building on the grammar (commit 1),
wire format (commit 2), and selection-layer (commit 3) changes in the
preceding patches.

* prepare_column_mutation_attribute() (prepare_expr.cc) now handles the
  subscript and field_selection nodes that the grammar produces:
  - For subscripts, it validates that the inner column is a non-frozen
    map or set and checks the 'writetime_ttl_individual_element' feature
    flag so the feature is rejected during rolling upgrades.
  - For field selections, it validates that the inner column is a
    non-frozen UDT, with the same feature-flag check.

* do_evaluate(column_mutation_attribute) (expression.cc) handles the
  same two cases. For a field selection it serializes the field index as
  a key and looks it up in collection_element_metadata; for a subscript
  it evaluates the subscript key and looks it up in the same map.
  A missing key (element not found or expired) returns NULL, matching
  Cassandra behavior.

Together with the preceding three patches, this finally fixes #15427.

The next three patches will add tests and documentation for the new
feature, and the final eighth patch will fix the implementation of
UDT fields in LWT expressions - which the first patch made the grammar
allow but is still not implemented correctly.
2026-04-12 13:28:28 +03:00
Nadav Har'El
4ac63de063 cql3: parse per-element timestamps/TTLs in the selection layer
Wire up the selection and result-set infrastructure to consume the
extended collection wire format introduced in the previous patch and
expose per-element timestamps and TTLs to the expression evaluator.

* Add collection_cell_metadata: maps from raw element-key bytes to
  timestamp and remaining TTL, one entry per collection or UDT cell.
  Add a corresponding collection_element_metadata span to
  evaluation_inputs so that evaluators can access it.

* Add a flag _collect_collection_timestamps to selection (selection.hh/cc).
  When any selected expression contains a WRITETIME(col[key])/TTL(col[key])
  or WRITETIME(col.field)/TTL(col.field) attribute, the flag is set and
  the send_collection_timestamps partition-slice option is enabled,
  causing storage nodes to use the extended wire format from the
  previous patch.

* Implement result_set_builder::add_collection() (selection.cc): when
  _collect_collection_timestamps is set, parse the extended format,
  decode per-element timestamps and remaining TTLs (computed from the
  stored expiry time and the query time), and store them in
  _collection_element_metadata indexed by column position.  When the
  flag is not set, the existing plain-bytes path is unchanged.

After this patch, the new selection feature is still not available to
the end-user because the prepare step still forbids it. The next patch
will finally complete the expression preparation and evaluation.
It will read the new collection_element_metadata and return the correct
timestamp or TTL value.
2026-04-12 12:51:06 +03:00
Nadav Har'El
bb63db34e5 cql3: add extended wire format for per-element timestamps and TTLs
Introduce the infrastructure needed to transport per-element timestamps
and TTL expiry times from replicas to coordinators, required for
WRITETIME(col[key]) / TTL(col[key]) and WRITETIME(col.field) /
TTL(col.field).

* Add a 'writetime_ttl_individual_element' cluster feature flag that
  guards usage of the new wire format during rolling upgrades: the
  extended format is only emitted and consumed when every node in the
  cluster supports it.

* Implement serialize_for_cql_with_timestamps() (types/types.cc), a
  variant of serialize_for_cql() that appends a per-element section to
  the regular CQL bytes, listing each live element's serialized key,
  timestamp, and expiry.  The format is:
    [uint32 cql_len][cql bytes]
    [int32  entry_count]
    [per entry: (int32 key_len)(key bytes)(int64 timestamp)(int64 expiry)]
  expiry is -1 when the element has no TTL.

* Add partition_slice::option::send_collection_timestamps and modify
  write_cell() (mutation_partition.cc) to use the new function
  serialize_for_cql_with_timestamps() when this option is available.

This commit stands alone with no user-visible effect: nothing yet sets
the new partition-slice option.  The next patch adds the selection-layer
code that sets the option and parses the extended response.
2026-04-12 11:49:06 +03:00
Nadav Har'El
38b675737d cql3: extend WRITETIME/TTL grammar to accept collection and UDT elements
Previously, WRITETIME() and TTL() only accepted a simple column name
(cident), so WRITETIME(m['key']) or WRITETIME(x.a) was a syntax error.
This patch begins to implements support for applying WRITETIME() and
TTL() to individual elements of a non-frozen map, set or UDT, as
requested in issue #15427.

On its own this commit only changes the parser (Cql.g). The prepare
step still rejects subscript and field-selection nodes with an
invalid_request_exception, so there is no user-visible behavior change
yet - just that a syntax error is replaced by a different error.

Upcoming patches add the extended wire format for per-element timestamps
(commit 2), the selection layer that consumes it (commit 3), and the
prepare/evaluate logic that ties everything together (commit 4), after
which WRITETIME() and TTL(col[key]) for collection or UDT elements
will finally be fully functional.

The parser change in this patch expands the subscriptExpr rule to
support the col.field syntax, not only col[key]. This change also
allows the UDT field syntax to be used in LWT conditions, which is
another long-standing missing feature (#13624); But to correctly
support this feature we'll need an additional patch to fix a couple
of remaining bugs - this will be the eighth commit in this series.
2026-04-12 11:10:23 +03:00
Benny Halevy
e4f0539acf query: result_set: change row member to a chunked vector
To prevent large memory allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:49 +03:00
Benny Halevy
b433a5bcf8 query: result_set_row: make noexcept
Remove const specifier from result_set_row._cells member to make
the class nothrow_move_constructible and nothrow_move_assignable

To be used later in query result_set and friends.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:39 +03:00
Benny Halevy
c0607110c4 query: non_null_data_value: assert is_nothrow_move_constructible and assignable
To be used later in query result_set{row,} and friends.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:34 +03:00
Benny Halevy
afa438d60d types: data_value: assert is_nothrow_move_constructible and assignable
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2026-04-12 10:00:13 +03:00
Piotr Smaron
477353b15c auth: sanitize {USER} substitution in LDAP URL templates
LDAPRoleManager interpolated usernames directly into ldap_url_template.
That allowed LDAP filter metacharacters to change the query, and URL
metacharacters such as %, ?, and # to change how ldap_url_parse()
split the URL.

Apply two layers of encoding when substituting {USER}:
  1. RFC 4515 filter escaping -- neutralises filter operators.
  2. URL percent-encoding    -- prevents ldap_url_parse from
     misinterpreting %-sequences, ? delimiters, or # fragments.

Add validate_query_template() (called from start()) which uses a
sentinel round-trip through ldap_url_parse to reject templates
that place {USER} outside the filter component.  Templates that
previously placed {USER} in the host or base DN were silently
accepted; they are now rejected at startup with a descriptive
error.

Change parse_url() to take const sstring& instead of string_view
to enforce the null-termination requirement of ldap_url_parse()
at the type level.

Add regression coverage for %2a, ?, #, and invalid {USER}
placement in the base DN, host, attributes, and extensions.

Update LDAP authorization docs to document the escaping behavior
and the {USER} placement restriction.

Fixes: SCYLLADB-1309
2026-04-10 14:00:47 +02:00
Dawid Pawlik
2dd8eef38c vector_index: store index_version as creation timeuuid
Vector indexes currently store the base table schema version in
`index_version`. That value is name-based, not time-based,
so it does not represent when the index was created.

Store a timeuuid instead and change the relevant interfaces from
`table_schema_version` to `utils::UUID`. This is a prerequisite
for supporting multiple vector indexes on the same column where
the oldest index must be selected deterministically via routing
implemented in Vector Store.

Update the cqlpy tests to check the new semantics directly:
recreating the index changes `index_version`, while ALTER TABLE does not.
2026-04-10 13:05:21 +02:00
Tomasz Grabiec
88bea5aaf3 cql: Include parallelized queries in the scylla_cql_select_partition_range_scan_no_bypass_cache metric
This metric is used to catch execution of scans which go via row
cache, which can have bad effect on performance.

Since f344bd0aaa, aggreagte queries go
via new statement class: parallelized_select_statement. This class
inherits from select_statement directly rather than from
primary_key_select_statement. The range scan detection logic
(_range_scan, _range_scan_no_bypass_cache) was only in
primary_key_select_statement's constructor, so parallelized queries
were not counted in select_partition_range_scan and
select_partition_range_scan_no_bypass_cache metrics.

Fix by moving the range scan detection into select_statement's
constructor, so that all subclasses get it.
2026-04-10 02:12:48 +02:00
Tomasz Grabiec
dc95d26464 test: cluster: dtest: Fix double-counting of metrics
get_node_metrics() in test/cluster/dtest/tools/metrics.py used
re.search(metric_name, metric) to match Prometheus metric lines. The
metric name select_partition_range_scan is a substring of
select_partition_range_scan_no_bypass_cache. So when querying for
select_partition_range_scan, the regex matched both Prometheus lines:

scylla_cql_select_partition_range_scan{shard="0",...} 1
scylla_cql_select_partition_range_scan_no_bypass_cache{shard="0",...} 1

And because the code does metrics_res[metric_name] += val, it summed
both values, making it look like the counter was incremented
by 2 when it was actually incremented by 1. The fix appends r"[\s{]"
to the regex so the metric name must be followed by { (labels) or
whitespace (value), preventing substring matches.
2026-04-10 02:12:48 +02:00
Pavel Emelyanov
5ffd3ccc8e test_backup: Remove create_ks_and_cf helper Test
Remove the create_ks_and_cf() helper function and its now-unused import
of format_tuples(). All callers have been converted to use the new
async patterns with new_test_keyspace() and cql.run_async().

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:40:13 +03:00
Pavel Emelyanov
2c81e54d6d test_backup: Replace create_ks_and_cf with async patterns Test
Replace all 6 calls to create_ks_and_cf() with new async patterns:
- Use new_test_keyspace() context manager for keyspace creation
- Use cql.run_async() for CREATE TABLE statement
- Use asyncio.gather() with cql.run_async() for data insertion

The test_restore_with_non_existing_sstable only needs the ks:table
structure to exist; it doesn't use the pre-populated data.

This change makes the code more explicit and maintains proper async
semantics throughout.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:39:56 +03:00
Pavel Emelyanov
66d9f6e042 test_backup: Add if-True blocks for indentation Test
Add if-True blocks to wrap code that uses create_ks_and_cf() in all 6
test functions. This is a mechanical change to set up the next step
where the helper will be replaced with new async patterns. All code
after the create_ks_and_cf() call until the end of each test is now
indented under the if-True block.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
2026-04-09 20:35:21 +03:00
Piotr Smaron
ecc3bcabd4 test/ldap: add LDAP filter-injection reproducers
Add tests that reproduce LDAP filter injection via unescaped {USER}
substitution (SCYLLADB-1309).  A wildcard username ('*') matches
every group entry, and a parenthesis payload (")(uid=*") breaks the
search filter.

Extend the LDAP test fixture (ldap_server.py, slapd.conf) with
memberUid attributes and the NIS schema so the new tests can
exercise direct filter-value substitution.
2026-04-08 13:53:49 +02:00
Ferenc Szili
7b308f3aa0 test: verify hints are delivered during tablet RF reduction
Add test_hint_to_leaving_when_reducing_rf which verifies that mutations
stored as hints are delivered to the correct replicas when a tablet is
removed due to RF reduction. The test sets up a 3-node cluster with
RF=2, drops the hint for one replica via error injection, then reduces
RF to 1 while hints are pending. It asserts that the mutation is
readable after the topology change completes.

Also adds a "drop_hint_for_host" error injection point in
hint_endpoint_manager to selectively drop hints for a specific host.
2026-03-31 09:18:42 +02:00
Ferenc Szili
1d64ddbdd3 hint_sender: use per-tablet is_leaving() to avoid losing hints on RF reduction
hint_sender decides whether to send a hint directly to its destination
or to re-mutate from scratch based on token_metadata::is_leaving(),
which only checks whether the *host* is leaving the cluster. When a
tablet is dropped from a host due to RF reduction (RF--), the host
is still alive and is_leaving() returns false, so hint_sender sends
directly to a replica that will no longer own the data -- effectively
losing the hint.

Switch to the new ermp->is_leaving(host, token) which is tablet-aware.
When the destination's tablet is being migrated away *and* there are
pending endpoints, send directly (the pending endpoints will receive
the data via streaming); otherwise fall through to the re-mutate path
so all current replicas receive the mutation.
2026-03-30 15:49:59 +02:00
Ferenc Szili
7db239b2ed erm: add is_leaving() to effective_replication_map
token_metadata::is_leaving() only knows whether a *host* is leaving the
cluster, which is insufficient for tablets -- a tablet can be migrated
away from a host (e.g. during RF reduction) without the host itself
leaving.

Add a pure virtual is_leaving(host, token) to effective_replication_map
so callers can ask per-token questions. The vnode implementation
delegates to token_metadata::is_leaving() (host-level, as before). The
tablet implementation checks whether the tablet owning the token has a
transition whose leaving replica matches the given host.
2026-03-30 15:49:01 +02:00
Aleksandra Martyniuk
166b293d06 test: add test_failed_tablet_rebuild_is_retried_on_alter
Test if alter keyspace statement with the current rf values will
fix the state of replicas.
2026-03-27 17:29:31 +01:00
Aleksandra Martyniuk
9ec54a8207 test: add a test to ensure that failed rebuilds are retried 2026-03-27 17:29:31 +01:00
Aleksandra Martyniuk
200dc084c5 service: fail ALTER KEYSPACE if replicas do not satisfy the replication
RF change of tablet keyspace starts tablet rebuilds. Even if any of
the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. Yet, we are left with not
enough replicas. Then, a next new rf change request handler would
generate a rebuild of two replicas of the same tablet. Such a transition
would not be applied, as we don't allow many pending replicas.
An exception would be thrown and the request would be retried infinitely,
blocking the topology coordinator.

Throw and fail rf change request if there is not enough replicas.
The request should be retried later, after the issue is fixed
by the mechanism introduced in previous changes.
2026-03-27 17:29:26 +01:00
Aleksandra Martyniuk
7951f92270 service: retry failed tablet rebuilds
RF change of tablet keyspace starts tablet rebuilds. Even if any
of the rebuilds is rolled back (because pending replica was excluded),
rf change request finishes successfully. In this case we end up with
the state of the replicas that isn't compatible with the expected
keyspace replication.

After this change, if topology_coordinator has nothing to do, it
proceeds to check if the state of replicas reflects the keyspace
replication. If there are any mismatches, the tablet rebuilds are
scheduled. All required rebuilds of a single keyspace are scheduled
together without respecting the node's load (just as it happens
in case of keyspace rf change).
2026-03-27 17:26:45 +01:00
Aleksandra Martyniuk
6f1bba8faf service: maybe_start_tablet_migration returns std::optional<group0_guard>
maybe_start_tablet_migration takes an ownership of group0_guard and
does not give it back, even if no work was done.

In the following patches, we will proceed with different operations,
if there are no migrations to be started. Thus, the guard would be needed.

Return group0_guard from  maybe_start_tablet_migration is no work
was done.
2026-03-27 17:26:45 +01:00
Artsiom Mishuta
0ede308a04 test/pylib: save logs on success only during teardown phase
Previously, when --save-log-on-success was enabled, logs were saved
for every test phase (setup, call, teardown)in 3 files. Restrict it to only
the teardown phase, that contains all 3 in case of test success,
to avoid redundant log entries.
2026-03-19 16:35:22 +01:00
Artsiom Mishuta
cbc07569c0 test: Lower default log level from DEBUG to INFO
1. test.py — Removed --log-level=DEBUG flag from pytest args
2. test/pytest.ini — Changed log_level to INFO (that was set DEBUG in test.py), changed log_file_level from DEBUG to INFO, added clarifying comments
2026-03-19 16:32:30 +01:00
Amnon Heiman
03d7ab17c9 storage_proxy: migrate CAS contention histograms to estimated_histogram_with_max
Replace CAS contention histograms in storage proxy stats with
estimated_histogram_with_max<128> and switch metrics/API aggregation to the
new histogram path.

Introduce a dedicated cas_contention_histogram alias and use it for
cas_read_contention and cas_write_contention.

Update API histogram reduction to merge the new histogram type via
estimated_histogram_with_max_merge.

Convert API JSON serialization to explicit offsets/counts using
get_buckets_offsets() and get_buckets_counts().

Export CAS contention metrics with to_metrics_histogram(...) instead of the
legacy get_histogram(1, 8) path for consistent bucket handling.
2026-03-12 14:10:35 +01:00
Amnon Heiman
cedd049218 estimated_histogram.hh: Add bucket offset and count to approx_exponential_histogram
Add utility accessors to approx_exponential_histogram to export bucket
boundaries and bucket counts in a form suitable for display/tests when
Min < Precision causes repeated integer limits.

Add MAX compile-time constant alias for the template Max parameter.
Add get_buckets_offsets() to return bucket lower limits with duplicate
adjacent limits removed.

Add get_buckets_counts() to return counts aligned with the deduplicated
limits, merging counts from buckets that share the same lower limit.
Keep existing histogram behavior unchanged.
This new functionality is intended for API use and not for
performance-critical paths.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
2026-03-12 14:04:40 +01:00
Nadav Har'El
b411d436de config: move named_value<T> method bodies out-of-line
The previous commit added extern template declarations to suppress
named_value<T> instantiation in every translation units, but those only
suppress non-inline members. All method bodies defined inside the class
body were inline and thus exempt from extern template, so they were
still emitted as weak symbols in every TU that used them.

Fix this by moving all named_value<T> method definitions out of the class
body in config_file.hh and into config_file_impl.hh as out-of-line template
definitions.  Since config_file_impl.hh is included only by db/config.cc,
utils/config_file.cc, sstables/compressor.cc, and
ent/encryption/encryption_config.cc, the method bodies are now compiled
in only those four TUs.

Also add the two missing explicit instantiation pairs that caused linker
errors:
- named_value<vector<object_storage_endpoint_param>> in db/config.cc
- named_value<encryption_config::string_string_map> in encryption_config.cc
2026-03-11 13:20:03 +02:00
Nadav Har'El
e0c13518ae config: suppress named_value<T> instantiation in every source file
config.hh is included by a large fraction of the codebase. It pulls in
utils/config_file.hh, whose named_value<T> template has its method
bodies defined in config_file_impl.hh. Those bodies depend on three of
the heaviest Boost headers – boost/program_options.hpp,
boost/lexical_cast.hpp, and boost/regex.hpp – as well as yaml-cpp.
Because the method bodies are reachable from config.hh, every
translation unit that includes config.hh was silently instantiating all
of named_value<T>'s methods (for each distinct T) and compiling that
Boost/yaml-cpp machinery from scratch.

Fix this by adding extern template struct declarations for all 32
distinct named_value<T> specialisations used by db::config:
- the 14 primitive/stdlib types go into utils/config_file.hh
- the 18 db-specific types (enum_option<…>, seed_provider_type, etc.)
  go into db/config.hh

Matching explicit template struct instantiation definitions are added in
db/config.cc, which is already the only translation unit that includes
config_file_impl.hh.  As a result the Boost/yaml-cpp template machinery
is compiled exactly once (in config.o) instead of being re-instantiated
in every including TU.

One subtlety: named_value<seed_provider_type> has an explicit member
specialisation of add_command_line_option.  Per [temp.expl.spec], such
a specialisation must be declared before any extern template declaration
of the enclosing class template, so a forward declaration of the
specialisation is added to config.hh ahead of the extern template line.

Also, for some of the types we explicitly instantiated in db/config.cc,
the named_value<T> constructor calls config_type_for<T>(), which we
also need to provide explicit specializations - some of them we already
had but some were missing.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2026-03-11 11:30:39 +02:00
379 changed files with 19008 additions and 5297 deletions

4
.github/CODEOWNERS vendored
View File

@@ -92,6 +92,10 @@ test/boost/querier_cache_test.cc @denesb
# PYTEST-BASED CQL TESTS
test/cqlpy/* @nyh
# TEST FRAMEWORK
test/pylib/* @xtrey
test.py @xtrey
# RAFT
raft/* @kbr-scylla @gleb-cloudius @kostja
test/raft/* @kbr-scylla @gleb-cloudius @kostja

View File

@@ -10,6 +10,9 @@ on:
types: [labeled, unlabeled]
branches: [master, next, enterprise]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-commit:
runs-on: ubuntu-latest
@@ -30,7 +33,7 @@ jobs:
echo "DEFAULT_BRANCH=master" >> $GITHUB_ENV
fi
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}

View File

@@ -5,6 +5,9 @@ on:
types: [opened, reopened, edited]
branches: [branch-*]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-fixes-prefix:
runs-on: ubuntu-latest
@@ -13,7 +16,7 @@ jobs:
issues: write
steps:
- name: Check PR body for "Fixes" prefix patterns
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const body = context.payload.pull_request.body;

View File

@@ -12,6 +12,9 @@ on:
description: 'the md5sum for scylla executable'
value: ${{ jobs.build.outputs.md5sum }}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
@@ -24,7 +27,7 @@ jobs:
outputs:
md5sum: ${{ steps.checksum.outputs.md5sum }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: recursive
- name: Generate the building system

View File

@@ -9,6 +9,7 @@ env:
HEADER_CHECK_LINES: 10
LICENSE: "LicenseRef-ScyllaDB-Source-Available-1.1"
CHECKED_EXTENSIONS: ".cc .hh .py"
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
check-license-headers:
@@ -19,7 +20,7 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
@@ -40,7 +41,7 @@ jobs:
- name: Comment on PR if check fails
if: failure()
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const license = '${{ env.LICENSE }}';

View File

@@ -9,6 +9,7 @@ env:
# use the development branch explicitly
CLANG_VERSION: 21
BUILD_DIR: build
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions: {}
@@ -32,7 +33,7 @@ jobs:
steps:
- run: |
sudo dnf -y install git
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- name: Install build dependencies

View File

@@ -18,6 +18,7 @@ env:
BUILD_TYPE: RelWithDebInfo
BUILD_DIR: build
CLANG_TIDY_CHECKS: '-*,bugprone-use-after-move'
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions: {}
@@ -42,7 +43,7 @@ jobs:
IMAGE: ${{ needs.read-toolchain.image }}
run: |
echo ${{ needs.read-toolchain.image }}
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- run: |

View File

@@ -7,13 +7,16 @@ on:
permissions:
issues: write
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
comment-and-close:
runs-on: ubuntu-latest
steps:
- name: Comment and close if author email is scylladb.com
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
github-token: ${{ secrets.GITHUB_TOKEN }}
script: |

View File

@@ -4,13 +4,15 @@ on:
branches:
- master
permissions: {}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
codespell:
name: Check for spelling errors
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: codespell-project/actions-codespell@master
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
- uses: codespell-project/actions-codespell@8f01853be192eb0f849a5c7d721450e7a467c579 # v2.2
with:
only_warn: 1
ignore_words_list: "ans,datas,fo,ser,ue,crate,nd,reenable,strat,stap,te,raison,iif,tread"

View File

@@ -0,0 +1,38 @@
name: Compare Build Systems
on:
pull_request:
branches:
- master
paths:
- 'configure.py'
- '**/CMakeLists.txt'
- 'cmake/**'
- 'scripts/compare_build_systems.py'
workflow_dispatch:
permissions:
contents: read
# cancel the in-progress run upon a repush
concurrency:
group: ${{ github.workflow }}-${{ github.ref }}
cancel-in-progress: true
jobs:
read-toolchain:
uses: ./.github/workflows/read-toolchain.yaml
compare:
name: Compare configure.py vs CMake
needs:
- read-toolchain
runs-on: ubuntu-latest
container: ${{ needs.read-toolchain.outputs.image }}
steps:
- uses: actions/checkout@v4
with:
submodules: true
- name: Compare build systems
run: |
git config --global --add safe.directory $GITHUB_WORKSPACE
python3 scripts/compare_build_systems.py --ci

View File

@@ -12,13 +12,16 @@ on:
schedule:
- cron: '0 10 * * 1' # Runs every Monday at 10:00am
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
notify_conflict_prs:
runs-on: ubuntu-latest
steps:
- name: Notify PR Authors of Conflicts
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
console.log("Starting conflict reminder script...");

View File

@@ -13,6 +13,9 @@ on:
permissions:
contents: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
lint:
runs-on: ubuntu-latest
@@ -21,12 +24,12 @@ jobs:
security-events: write
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0
- name: Differential ShellCheck
uses: redhat-plumbers-in-action/differential-shellcheck@v5
uses: redhat-plumbers-in-action/differential-shellcheck@d965e66ec0b3b2f821f75c8eff9b12442d9a7d1e # v5.5.6
with:
severity: warning
token: ${{ secrets.GITHUB_TOKEN }}

View File

@@ -5,6 +5,7 @@ name: "Docs / Publish"
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}
DEFAULT_BRANCH: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'master' }}
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
push:
@@ -25,17 +26,17 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
ref: ${{ env.DEFAULT_BRANCH }}
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -7,6 +7,7 @@ permissions:
env:
FLAG: ${{ github.repository == 'scylladb/scylla-enterprise' && 'enterprise' || 'opensource' }}
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
pull_request:
@@ -22,16 +23,16 @@ jobs:
runs-on: ubuntu-latest
steps:
- name: Checkout
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
persist-credentials: false
fetch-depth: 0
- name: Set up Python
uses: actions/setup-python@v5
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: "3.12"
- name: Install uv
uses: astral-sh/setup-uv@v6
uses: astral-sh/setup-uv@cec208311dfd045dd5311c1add060b2062131d57 # v8.0.0
- name: Set up env
run: make -C docs FLAG="${{ env.FLAG }}" setupenv
- name: Build docs

View File

@@ -3,6 +3,9 @@ name: Docs / Validate metrics
permissions:
contents: read
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
on:
pull_request:
branches:
@@ -21,12 +24,12 @@ jobs:
steps:
- name: Checkout code
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- name: Set up Python
uses: actions/setup-python@v6
uses: actions/setup-python@a309ff8b426b58ec0e2a45f0f869d46889d02405 # v6.2.0
with:
python-version: '3.10'

View File

@@ -13,6 +13,7 @@ env:
# supposed to be processed by idl-compiler.py, so we don't check them using the cleaner
CLEANER_DIRS: test/unit exceptions alternator api auth cdc compaction db dht gms index lang message mutation mutation_writer node_ops raft redis replica service
SEASTAR_BAD_INCLUDE_OUTPUT_PATH: build/seastar-bad-include.log
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
permissions:
contents: read
@@ -32,7 +33,7 @@ jobs:
runs-on: ubuntu-latest
container: ${{ needs.read-toolchain.outputs.image }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- name: Generate compilation database
@@ -89,7 +90,7 @@ jobs:
| tee "$SEASTAR_BAD_INCLUDE_OUTPUT_PATH"
- run: |
echo "::remove-matcher owner=seastar-bad-include::"
- uses: actions/upload-artifact@v4
- uses: actions/upload-artifact@bbbca2ddaa5d8feaa63e36b76fdaad77386f024f # v7.0.0
with:
name: Logs
path: |

View File

@@ -7,6 +7,7 @@ on:
env:
DEFAULT_BRANCH: 'master'
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
mark-ready:
@@ -17,7 +18,7 @@ jobs:
steps:
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: ${{ github.repository }}
ref: ${{ env.DEFAULT_BRANCH }}

View File

@@ -5,6 +5,8 @@ on:
branches:
- master
- next
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
label:
if: github.event.pull_request.draft == false
@@ -15,7 +17,7 @@ jobs:
steps:
- name: Wait for label to be added
run: sleep 1m
- uses: mheap/github-action-required-labels@v5
- uses: mheap/github-action-required-labels@0ac283b4e65c1fb28ce6079dea5546ceca98ccbe # v5.5.2
with:
mode: minimum
count: 1

View File

@@ -7,6 +7,9 @@ on:
description: "the toolchain docker image"
value: ${{ jobs.read-toolchain.outputs.image }}
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
runs-on: ubuntu-latest
@@ -15,7 +18,7 @@ jobs:
outputs:
image: ${{ steps.read.outputs.image }}
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: tools/toolchain/image
sparse-checkout-cone-mode: false

View File

@@ -13,6 +13,7 @@ concurrency:
env:
BUILD_DIR: build
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
read-toolchain:
@@ -29,12 +30,12 @@ jobs:
- RelWithDebInfo
- Dev
steps:
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
submodules: true
- run: |
rm -rf seastar
- uses: actions/checkout@v4
- uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
repository: scylladb/seastar
submodules: true

View File

@@ -7,6 +7,9 @@ on:
issues:
types: [labeled, unlabeled]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
label-sync:
if: ${{ github.repository == 'scylladb/scylladb' }}
@@ -21,7 +24,7 @@ jobs:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout repository
uses: actions/checkout@v4
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
sparse-checkout: |
.github/scripts/sync_labels.py

View File

@@ -5,7 +5,10 @@ on:
types: [opened, reopened, synchronize]
issue_comment:
types: [created]
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
trigger-ci:
runs-on: ubuntu-latest
@@ -15,7 +18,7 @@ jobs:
GITHUB_CONTEXT: ${{ toJson(github) }}
run: echo "$GITHUB_CONTEXT"
- name: Checkout PR code
uses: actions/checkout@v3
uses: actions/checkout@de0fac2e4500dabe0009e67214ff5f5447ce83dd # v6.0.2
with:
fetch-depth: 0 # Needed to access full history
ref: ${{ github.event.pull_request.head.ref }}

View File

@@ -4,13 +4,16 @@ on:
schedule:
- cron: '10 8 * * *' # Runs daily at 8 AM
env:
FORCE_JAVASCRIPT_ACTIONS_TO_NODE24: true
jobs:
reminder:
runs-on: ubuntu-latest
steps:
- name: Send reminders
uses: actions/github-script@v7
uses: actions/github-script@ed597411d8f924073f98dfc5c65a23a2325f34cd # v8.0.0
with:
script: |
const labelFilters = ['P0', 'P1', 'Field-Tier1','status/release blocker', 'status/regression'];

View File

@@ -9,6 +9,8 @@ target_sources(alternator
controller.cc
server.cc
executor.cc
executor_read.cc
executor_util.cc
stats.cc
serialization.cc
expressions.cc

View File

@@ -0,0 +1,253 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include <map>
#include <memory>
#include <optional>
#include <string>
#include <unordered_map>
#include <variant>
#include "utils/rjson.hh"
#include "utils/overloaded_functor.hh"
#include "alternator/error.hh"
#include "alternator/expressions_types.hh"
namespace alternator {
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
// attrs_to_get lists which top-level attribute are needed, and possibly also
// which part of the top-level attribute is really needed (when nested
// attribute paths appeared in the query).
// Most code actually uses optional<attrs_to_get>. There, a disengaged
// optional means we should get all attributes, not specific ones.
using attrs_to_get = attribute_path_map<std::monostate>;
// takes a given JSON value and drops its parts which weren't asked to be
// kept. It modifies the given JSON value, or returns false to signify that
// the entire object should be dropped.
// Note that The JSON value is assumed to be encoded using the DynamoDB
// conventions - i.e., it is really a map whose key has a type string,
// and the value is the real object.
template<typename T>
bool hierarchy_filter(rjson::value& val, const attribute_path_map_node<T>& h) {
if (!val.IsObject() || val.MemberCount() != 1) {
// This shouldn't happen. We shouldn't have stored malformed objects.
// But today Alternator does not validate the structure of nested
// documents before storing them, so this can happen on read.
throw api_error::internal(format("Malformed value object read: {}", val));
}
const char* type = val.MemberBegin()->name.GetString();
rjson::value& v = val.MemberBegin()->value;
if (h.has_members()) {
const auto& members = h.get_members();
if (type[0] != 'M' || !v.IsObject()) {
// If v is not an object (dictionary, map), none of the members
// can match.
return false;
}
rjson::value newv = rjson::empty_object();
for (auto it = v.MemberBegin(); it != v.MemberEnd(); ++it) {
std::string attr = rjson::to_string(it->name);
auto x = members.find(attr);
if (x != members.end()) {
if (x->second) {
// Only a part of this attribute is to be filtered, do it.
if (hierarchy_filter(it->value, *x->second)) {
// because newv started empty and attr are unique
// (keys of v), we can use add() here
rjson::add_with_string_name(newv, attr, std::move(it->value));
}
} else {
// The entire attribute is to be kept
rjson::add_with_string_name(newv, attr, std::move(it->value));
}
}
}
if (newv.MemberCount() == 0) {
return false;
}
v = newv;
} else if (h.has_indexes()) {
const auto& indexes = h.get_indexes();
if (type[0] != 'L' || !v.IsArray()) {
return false;
}
rjson::value newv = rjson::empty_array();
const auto& a = v.GetArray();
for (unsigned i = 0; i < v.Size(); i++) {
auto x = indexes.find(i);
if (x != indexes.end()) {
if (x->second) {
if (hierarchy_filter(a[i], *x->second)) {
rjson::push_back(newv, std::move(a[i]));
}
} else {
// The entire attribute is to be kept
rjson::push_back(newv, std::move(a[i]));
}
}
}
if (newv.Size() == 0) {
return false;
}
v = newv;
}
return true;
}
// Add a path to an attribute_path_map. Throws a validation error if the path
// "overlaps" with one already in the filter (one is a sub-path of the other)
// or "conflicts" with it (both a member and index is requested).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const parsed::path& p, T value = {}) {
using node = attribute_path_map_node<T>;
// The first step is to look for the top-level attribute (p.root()):
auto it = map.find(p.root());
if (it == map.end()) {
if (p.has_operators()) {
it = map.emplace(p.root(), node {std::nullopt}).first;
} else {
(void) map.emplace(p.root(), node {std::move(value)}).first;
// Value inserted for top-level node. We're done.
return;
}
} else if(!p.has_operators()) {
// If p is top-level and we already have it or a part of it
// in map, it's a forbidden overlapping path.
throw api_error::validation(fmt::format(
"Invalid {}: two document paths overlap at {}", source, p.root()));
} else if (it->second.has_value()) {
// If we're here, it != map.end() && p.has_operators && it->second.has_value().
// This means the top-level attribute already has a value, and we're
// trying to add a non-top-level value. It's an overlap.
throw api_error::validation(fmt::format("Invalid {}: two document paths overlap at {}", source, p.root()));
}
node* h = &it->second;
// The second step is to walk h from the top-level node to the inner node
// where we're supposed to insert the value:
for (const auto& op : p.operators()) {
std::visit(overloaded_functor {
[&] (const std::string& member) {
if (h->is_empty()) {
*h = node {typename node::members_t()};
} else if (h->has_indexes()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::members_t& members = h->get_members();
auto it = members.find(member);
if (it == members.end()) {
it = members.insert({member, std::make_unique<node>()}).first;
}
h = it->second.get();
},
[&] (unsigned index) {
if (h->is_empty()) {
*h = node {typename node::indexes_t()};
} else if (h->has_members()) {
throw api_error::validation(format("Invalid {}: two document paths conflict at {}", source, p));
} else if (h->has_value()) {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
typename node::indexes_t& indexes = h->get_indexes();
auto it = indexes.find(index);
if (it == indexes.end()) {
it = indexes.insert({index, std::make_unique<node>()}).first;
}
h = it->second.get();
}
}, op);
}
// Finally, insert the value in the node h.
if (h->is_empty()) {
*h = node {std::move(value)};
} else {
throw api_error::validation(format("Invalid {}: two document paths overlap at {}", source, p));
}
}
// A very simplified version of the above function for the special case of
// adding only top-level attribute. It's not only simpler, we also use a
// different error message, referring to a "duplicate attribute" instead of
// "overlapping paths". DynamoDB also has this distinction (errors in
// AttributesToGet refer to duplicates, not overlaps, but errors in
// ProjectionExpression refer to overlap - even if it's an exact duplicate).
template<typename T>
void attribute_path_map_add(const char* source, attribute_path_map<T>& map, const std::string& attr, T value = {}) {
using node = attribute_path_map_node<T>;
auto it = map.find(attr);
if (it == map.end()) {
map.emplace(attr, node {std::move(value)});
} else {
throw api_error::validation(fmt::format(
"Invalid {}: Duplicate attribute: {}", source, attr));
}
}
} // namespace alternator

View File

@@ -18,6 +18,7 @@
#include "service/memory_limiter.hh"
#include "auth/service.hh"
#include "service/qos/service_level_controller.hh"
#include "vector_search/vector_store_client.hh"
using namespace seastar;
@@ -31,10 +32,12 @@ controller::controller(
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
const db::config& config,
seastar::scheduling_group sg)
: protocol_server(sg)
@@ -43,10 +46,12 @@ controller::controller(
, _ss(ss)
, _mm(mm)
, _sys_dist_ks(sys_dist_ks)
, _sys_ks(sys_ks)
, _cdc_gen_svc(cdc_gen_svc)
, _memory_limiter(memory_limiter)
, _auth_service(auth_service)
, _sl_controller(sl_controller)
, _vsc(vsc)
, _config(config)
{
}
@@ -91,8 +96,8 @@ future<> controller::start_server() {
auto get_timeout_in_ms = [] (const db::config& cfg) -> utils::updateable_value<uint32_t> {
return cfg.alternator_timeout_in_ms;
};
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), _ssg.value(),
_executor.start(std::ref(_gossiper), std::ref(_proxy), std::ref(_ss), std::ref(_mm), std::ref(_sys_dist_ks), std::ref(_sys_ks),
sharded_parameter(get_cdc_metadata, std::ref(_cdc_gen_svc)), std::ref(_vsc), _ssg.value(),
sharded_parameter(get_timeout_in_ms, std::ref(_config))).get();
_server.start(std::ref(_executor), std::ref(_proxy), std::ref(_gossiper), std::ref(_auth_service), std::ref(_sl_controller)).get();
// Note: from this point on, if start_server() throws for any reason,

View File

@@ -22,6 +22,7 @@ class memory_limiter;
namespace db {
class system_distributed_keyspace;
class system_keyspace;
class config;
}
@@ -43,6 +44,10 @@ namespace qos {
class service_level_controller;
}
namespace vector_search {
class vector_store_client;
}
namespace alternator {
// This is the official DynamoDB API version.
@@ -61,10 +66,12 @@ class controller : public protocol_server {
sharded<service::storage_service>& _ss;
sharded<service::migration_manager>& _mm;
sharded<db::system_distributed_keyspace>& _sys_dist_ks;
sharded<db::system_keyspace>& _sys_ks;
sharded<cdc::generation_service>& _cdc_gen_svc;
sharded<service::memory_limiter>& _memory_limiter;
sharded<auth::service>& _auth_service;
sharded<qos::service_level_controller>& _sl_controller;
sharded<vector_search::vector_store_client>& _vsc;
const db::config& _config;
std::vector<socket_address> _listen_addresses;
@@ -79,10 +86,12 @@ public:
sharded<service::storage_service>& ss,
sharded<service::migration_manager>& mm,
sharded<db::system_distributed_keyspace>& sys_dist_ks,
sharded<db::system_keyspace>& sys_ks,
sharded<cdc::generation_service>& cdc_gen_svc,
sharded<service::memory_limiter>& memory_limiter,
sharded<auth::service>& auth_service,
sharded<qos::service_level_controller>& sl_controller,
sharded<vector_search::vector_store_client>& vsc,
const db::config& config,
seastar::scheduling_group sg);

File diff suppressed because it is too large Load Diff

View File

@@ -9,7 +9,9 @@
#pragma once
#include <seastar/core/future.hh>
#include "audit/audit.hh"
#include "seastarx.hh"
#include <seastar/core/future.hh>
#include <seastar/core/sharded.hh>
#include <seastar/util/noncopyable_function.hh>
@@ -20,15 +22,23 @@
#include "db/config.hh"
#include "alternator/error.hh"
#include "stats.hh"
#include "alternator/attribute_path.hh"
#include "alternator/stats.hh"
#include "alternator/executor_util.hh"
#include "utils/rjson.hh"
#include "utils/updateable_value.hh"
#include "utils/simple_value_with_expiry.hh"
#include "tracing/trace_state.hh"
namespace db {
class system_distributed_keyspace;
class system_keyspace;
}
namespace audit {
class audit_info_alternator;
}
namespace query {
@@ -46,6 +56,10 @@ namespace service {
class storage_service;
}
namespace vector_search {
class vector_store_client;
}
namespace cdc {
class metadata;
}
@@ -58,82 +72,13 @@ class gossiper;
class schema_builder;
namespace alternator {
enum class table_status;
class rmw_operation;
class put_or_delete_item;
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
// Wraps the db::get_tags_of_table and throws if the table is missing the tags extension.
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);
// An attribute_path_map object is used to hold data for various attributes
// paths (parsed::path) in a hierarchy of attribute paths. Each attribute path
// has a root attribute, and then modified by member and index operators -
// for example in "a.b[2].c" we have "a" as the root, then ".b" member, then
// "[2]" index, and finally ".c" member.
// Data can be added to an attribute_path_map using the add() function, but
// requires that attributes with data not be *overlapping* or *conflicting*:
//
// 1. Two attribute paths which are identical or an ancestor of one another
// are considered *overlapping* and not allowed. If a.b.c has data,
// we can't add more data in a.b.c or any of its descendants like a.b.c.d.
//
// 2. Two attribute paths which need the same parent to have both a member and
// an index are considered *conflicting* and not allowed. E.g., if a.b has
// data, you can't add a[1]. The meaning of adding both would be that the
// attribute a is both a map and an array, which isn't sensible.
//
// These two requirements are common to the two places where Alternator uses
// this abstraction to describe how a hierarchical item is to be transformed:
//
// 1. In ProjectExpression: for filtering from a full top-level attribute
// only the parts for which user asked in ProjectionExpression.
//
// 2. In UpdateExpression: for taking the previous value of a top-level
// attribute, and modifying it based on the instructions in the user
// wrote in UpdateExpression.
template<typename T>
class attribute_path_map_node {
public:
using data_t = T;
// We need the extra unique_ptr<> here because libstdc++ unordered_map
// doesn't work with incomplete types :-(
using members_t = std::unordered_map<std::string, std::unique_ptr<attribute_path_map_node<T>>>;
// The indexes list is sorted because DynamoDB requires handling writes
// beyond the end of a list in index order.
using indexes_t = std::map<unsigned, std::unique_ptr<attribute_path_map_node<T>>>;
// The prohibition on "overlap" and "conflict" explained above means
// That only one of data, members or indexes is non-empty.
std::optional<std::variant<data_t, members_t, indexes_t>> _content;
bool is_empty() const { return !_content; }
bool has_value() const { return _content && std::holds_alternative<data_t>(*_content); }
bool has_members() const { return _content && std::holds_alternative<members_t>(*_content); }
bool has_indexes() const { return _content && std::holds_alternative<indexes_t>(*_content); }
// get_members() assumes that has_members() is true
members_t& get_members() { return std::get<members_t>(*_content); }
const members_t& get_members() const { return std::get<members_t>(*_content); }
indexes_t& get_indexes() { return std::get<indexes_t>(*_content); }
const indexes_t& get_indexes() const { return std::get<indexes_t>(*_content); }
T& get_value() { return std::get<T>(*_content); }
const T& get_value() const { return std::get<T>(*_content); }
};
template<typename T>
using attribute_path_map = std::unordered_map<std::string, attribute_path_map_node<T>>;
using attrs_to_get_node = attribute_path_map_node<std::monostate>;
// attrs_to_get lists which top-level attribute are needed, and possibly also
// which part of the top-level attribute is really needed (when nested
// attribute paths appeared in the query).
// Most code actually uses optional<attrs_to_get>. There, a disengaged
// optional means we should get all attributes, not specific ones.
using attrs_to_get = attribute_path_map<std::monostate>;
namespace parsed {
class expression_cache;
}
@@ -144,9 +89,12 @@ class executor : public peering_sharded_service<executor> {
service::storage_proxy& _proxy;
service::migration_manager& _mm;
db::system_distributed_keyspace& _sdks;
db::system_keyspace& _system_keyspace;
cdc::metadata& _cdc_metadata;
vector_search::vector_store_client& _vsc;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
seastar::sharded<audit::audit>& _audit;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -171,7 +119,6 @@ public:
// is written in chunks to the output_stream. This allows for efficient
// handling of large responses without needing to allocate a large buffer
// in memory.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
using request_return_type = std::variant<std::string, body_writer, api_error>;
stats _stats;
// The metric_groups object holds this stat object's metrics registered
@@ -186,53 +133,60 @@ public:
service::storage_service& ss,
service::migration_manager& mm,
db::system_distributed_keyspace& sdks,
db::system_keyspace& system_keyspace,
cdc::metadata& cdc_metadata,
vector_search::vector_store_client& vsc,
smp_service_group ssg,
utils::updateable_value<uint32_t> default_timeout_in_ms);
~executor();
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> update_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> list_tables(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header);
future<request_return_type> batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> batch_get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> query(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request);
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request);
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request);
future<request_return_type> create_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> delete_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_table(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> put_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> delete_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_tables(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> scan(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_endpoints(client_state& client_state, service_permit permit, rjson::value request, std::string host_header, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> batch_write_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> batch_get_item(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> query(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> tag_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> untag_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_tags_of_resource(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> update_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> list_streams(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_stream(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> get_records(client_state& client_state, tracing::trace_state_ptr, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<request_return_type> describe_continuous_backups(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<> start();
future<> stop();
static sstring table_name(const schema&);
static db::timeout_clock::time_point default_timeout();
private:
static thread_local utils::updateable_value<uint32_t> s_default_timeout_in_ms;
public:
static schema_ptr find_table(service::storage_proxy&, std::string_view table_name);
static schema_ptr find_table(service::storage_proxy&, const rjson::value& request);
private:
friend class rmw_operation;
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);
// Helper to set up auditing for an Alternator operation. Checks whether
// the operation should be audited (via will_log()) and if so, allocates
// and populates audit_info. No allocation occurs when auditing is disabled.
void maybe_audit(std::unique_ptr<audit::audit_info_alternator>& audit_info,
audit::statement_category category,
std::string_view ks_name,
std::string_view table_name,
std::string_view operation_name,
const rjson::value& request,
std::optional<db::consistency_level> cl = std::nullopt);
future<rjson::value> fill_table_description(schema_ptr schema, table_status tbl_status, service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit);
future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization, bool warn_authorization, const db::tablets_mode_t::mode tablets_mode);
future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, bool enforce_authorization,
bool warn_authorization, const db::tablets_mode_t::mode tablets_mode, std::unique_ptr<audit::audit_info_alternator>& audit_info);
future<> do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
@@ -245,60 +199,34 @@ private:
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);
static std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
uint64_t* item_length_in_bytes = nullptr,
bool = false);
static bool add_stream_options(const rjson::value& stream_spec, schema_builder&, service::storage_proxy& sp);
static void supplement_table_info(rjson::value& descr, const schema& schema, service::storage_proxy& sp);
static void supplement_table_stream_info(rjson::value& descr, const schema& schema, const service::storage_proxy& sp);
};
// is_big() checks approximately if the given JSON value is "bigger" than
// the given big_size number of bytes. The goal is to *quickly* detect
// oversized JSON that, for example, is too large to be serialized to a
// contiguous string - we don't need an accurate size for that. Moreover,
// as soon as we detect that the JSON is indeed "big", we can return true
// and don't need to continue calculating its exact size.
// For simplicity, we use a recursive implementation. This is fine because
// Alternator limits the depth of JSONs it reads from inputs, and doesn't
// add more than a couple of levels in its own output construction.
bool is_big(const rjson::value& val, int big_size = 100'000);
// returns table creation time in seconds since epoch for `db_clock`
double get_table_creation_time(const schema &schema);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);
/**
* Make return type for serializing the object "streamed",
* i.e. direct to HTTP output stream. Note: only useful for
* (very) large objects as there are overhead issues with this
* as well, but for massive lists of return objects this can
* help avoid large allocations/many re-allocs
*/
executor::body_writer make_streamed(rjson::value&&);
// result of parsing ARN (Amazon Resource Name)
// ARN format is `arn:<partition>:<service>:<region>:<account-id>:<resource-type>/<resource-id>/<postfix>`
// we ignore partition, service and account-id
// resource-type must be string "table"
// resource-id will be returned as table_name
// region will be returned as keyspace_name
// postfix is a string after resource-id and will be returned as is (whole), including separator.
struct arn_parts {
std::string_view keyspace_name;
std::string_view table_name;
std::string_view postfix;
};
// arn - arn to parse
// arn_field_name - identifier of the ARN, used only when reporting an error (in error messages), for example "Incorrect resource identifier `<arn_field_name>`"
// type_name - used only when reporting an error (in error messages), for example "... is not a valid <type_name> ARN ..."
// expected_postfix - optional filter of postfix value (part of ARN after resource-id, including separator, see comments for struct arn_parts).
// If is empty - then postfix value must be empty as well
// if not empty - postfix value must start with expected_postfix, but might be longer
arn_parts parse_arn(std::string_view arn, std::string_view arn_field_name, std::string_view type_name, std::string_view expected_postfix);
// The format is ks1|ks2|ks3... and table1|table2|table3...
sstring print_names_for_audit(const std::set<sstring>& names);
}

1957
alternator/executor_read.cc Normal file

File diff suppressed because it is too large Load Diff

559
alternator/executor_util.cc Normal file
View File

@@ -0,0 +1,559 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "alternator/executor_util.hh"
#include "alternator/executor.hh"
#include "alternator/error.hh"
#include "auth/resource.hh"
#include "auth/service.hh"
#include "cdc/log.hh"
#include "data_dictionary/data_dictionary.hh"
#include "db/tags/utils.hh"
#include "replica/database.hh"
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "serialization.hh"
#include "service/storage_proxy.hh"
#include "types/map.hh"
#include <fmt/format.h>
namespace alternator {
extern logging::logger elogger; // from executor.cc
std::optional<int> get_int_attribute(const rjson::value& value, std::string_view attribute_name) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value)
return {};
if (!attribute_value->IsInt()) {
throw api_error::validation(fmt::format("Expected integer value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetInt();
}
std::string get_string_attribute(const rjson::value& value, std::string_view attribute_name, const char* default_return) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value)
return default_return;
if (!attribute_value->IsString()) {
throw api_error::validation(fmt::format("Expected string value for attribute {}, got: {}",
attribute_name, value));
}
return rjson::to_string(*attribute_value);
}
bool get_bool_attribute(const rjson::value& value, std::string_view attribute_name, bool default_return) {
const rjson::value* attribute_value = rjson::find(value, attribute_name);
if (!attribute_value) {
return default_return;
}
if (!attribute_value->IsBool()) {
throw api_error::validation(fmt::format("Expected boolean value for attribute {}, got: {}",
attribute_name, value));
}
return attribute_value->GetBool();
}
std::optional<std::string> find_table_name(const rjson::value& request) {
const rjson::value* table_name_value = rjson::find(request, "TableName");
if (!table_name_value) {
return std::nullopt;
}
if (!table_name_value->IsString()) {
throw api_error::validation("Non-string TableName field in request");
}
std::string table_name = rjson::to_string(*table_name_value);
return table_name;
}
std::string get_table_name(const rjson::value& request) {
auto name = find_table_name(request);
if (!name) {
throw api_error::validation("Missing TableName field in request");
}
return *name;
}
schema_ptr find_table(service::storage_proxy& proxy, const rjson::value& request) {
auto table_name = find_table_name(request);
if (!table_name) {
return nullptr;
}
return find_table(proxy, *table_name);
}
schema_ptr find_table(service::storage_proxy& proxy, std::string_view table_name) {
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + sstring(table_name), table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Table: {} not found", table_name));
}
}
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request) {
auto schema = find_table(proxy, request);
if (!schema) {
// if we get here then the name was missing, since syntax or missing actual CF
// checks throw. Slow path, but just call get_table_name to generate exception.
get_table_name(request);
}
return schema;
}
map_type attrs_type() {
static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);
return t;
}
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema) {
auto tags_ptr = db::get_tags_of_table(schema);
if (tags_ptr) {
return *tags_ptr;
} else {
throw api_error::validation(format("Table {} does not have valid tagging information", schema->ks_name()));
}
}
bool is_alternator_keyspace(std::string_view ks_name) {
return ks_name.starts_with(executor::KEYSPACE_NAME_PREFIX);
}
// This tag is set on a GSI when the user did not specify a range key, causing
// Alternator to add the base table's range key as a spurious range key. It is
// used by describe_key_schema() to suppress reporting that key.
extern const sstring SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY;
void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string, std::string>* attribute_types, const std::map<sstring, sstring>* tags) {
rjson::value key_schema = rjson::empty_array();
const bool ignore_range_keys_as_spurious = tags != nullptr && tags->contains(SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY);
for (const column_definition& cdef : schema.partition_key_columns()) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(cdef.name_as_text()));
rjson::add(key, "KeyType", "HASH");
rjson::push_back(key_schema, std::move(key));
if (attribute_types) {
(*attribute_types)[cdef.name_as_text()] = type_to_string(cdef.type);
}
}
if (!ignore_range_keys_as_spurious) {
// NOTE: user requested key (there can be at most one) will always come first.
// There might be more keys following it, which were added, but those were
// not requested by the user, so we ignore them.
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::value key = rjson::empty_object();
rjson::add(key, "AttributeName", rjson::from_string(cdef.name_as_text()));
rjson::add(key, "KeyType", "RANGE");
rjson::push_back(key_schema, std::move(key));
if (attribute_types) {
(*attribute_types)[cdef.name_as_text()] = type_to_string(cdef.type);
}
break;
}
}
rjson::add(parent, "KeySchema", std::move(key_schema));
}
// Check if the given string has valid characters for a table name, i.e. only
// a-z, A-Z, 0-9, _ (underscore), - (dash), . (dot). Note that this function
// does not check the length of the name - instead, use validate_table_name()
// to validate both the characters and the length.
static bool valid_table_name_chars(std::string_view name) {
for (auto c : name) {
if ((c < 'a' || c > 'z') &&
(c < 'A' || c > 'Z') &&
(c < '0' || c > '9') &&
c != '_' &&
c != '-' &&
c != '.') {
return false;
}
}
return true;
}
std::string view_name(std::string_view table_name, std::string_view index_name, const std::string& delim, bool validate_len) {
if (index_name.length() < 3) {
throw api_error::validation("IndexName must be at least 3 characters long");
}
if (!valid_table_name_chars(index_name)) {
throw api_error::validation(
fmt::format("IndexName '{}' must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", index_name));
}
std::string ret = std::string(table_name) + delim + std::string(index_name);
if (ret.length() > max_auxiliary_table_name_length && validate_len) {
throw api_error::validation(
fmt::format("The total length of TableName ('{}') and IndexName ('{}') cannot exceed {} characters",
table_name, index_name, max_auxiliary_table_name_length - delim.size()));
}
return ret;
}
std::string gsi_name(std::string_view table_name, std::string_view index_name, bool validate_len) {
return view_name(table_name, index_name, ":", validate_len);
}
std::string lsi_name(std::string_view table_name, std::string_view index_name, bool validate_len) {
return view_name(table_name, index_name, "!:", validate_len);
}
void check_key(const rjson::value& key, const schema_ptr& schema) {
if (key.MemberCount() != (schema->clustering_key_size() == 0 ? 1 : 2)) {
throw api_error::validation("Given key attribute not in schema");
}
}
void verify_all_are_used(const rjson::value* field,
const std::unordered_set<std::string>& used, const char* field_name, const char* operation) {
if (!field) {
return;
}
for (auto it = field->MemberBegin(); it != field->MemberEnd(); ++it) {
if (!used.contains(rjson::to_string(it->name))) {
throw api_error::validation(
format("{} has spurious '{}', not used in {}",
field_name, rjson::to_string_view(it->name), operation));
}
}
}
// This function increments the authorization_failures counter, and may also
// log a warn-level message and/or throw an access_denied exception, depending
// on what enforce_authorization and warn_authorization are set to.
// Note that if enforce_authorization is false, this function will return
// without throwing. So a caller that doesn't want to continue after an
// authorization_error must explicitly return after calling this function.
static void authorization_error(stats& stats, bool enforce_authorization, bool warn_authorization, std::string msg) {
stats.authorization_failures++;
if (enforce_authorization) {
if (warn_authorization) {
elogger.warn("alternator_warn_authorization=true: {}", msg);
}
throw api_error::access_denied(std::move(msg));
} else {
if (warn_authorization) {
elogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {}", msg);
}
}
}
future<> verify_permission(
bool enforce_authorization,
bool warn_authorization,
const service::client_state& client_state,
const schema_ptr& schema,
auth::permission permission_to_check,
stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
// Unfortunately, the fix for issue #23218 did not modify the function
// that we use here - check_has_permissions(). So if we want to allow
// writes to internal tables (from try_get_internal_table()) only to a
// superuser, we need to explicitly check it here.
if (permission_to_check == auth::permission::MODIFY && is_internal_keyspace(schema->ks_name())) {
if (!client_state.user() ||
!client_state.user()->name ||
!co_await client_state.get_auth_service()->underlying_role_manager().is_superuser(*client_state.user()->name)) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"Write access denied on internal table {}.{} to role {} because it is not a superuser",
schema->ks_name(), schema->cf_name(), username));
co_return;
}
}
auto resource = auth::make_data_resource(schema->ks_name(), schema->cf_name());
if (!client_state.user() || !client_state.user()->name ||
!co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
// Using exceptions for errors makes this function faster in the
// success path (when the operation is allowed).
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"{} access on table {}.{} is denied to role {}, client address {}",
auth::permissions::to_string(permission_to_check),
schema->ks_name(), schema->cf_name(), username, client_state.get_client_address()));
}
}
// Similar to verify_permission() above, but just for CREATE operations.
// Those do not operate on any specific table, so require permissions on
// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state& client_state, stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
auto resource = auth::resource(auth::resource_kind::data);
if (!co_await client_state.check_has_permission(auth::command_desc(auth::permission::CREATE, resource))) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"CREATE access on ALL KEYSPACES is denied to role {}", username));
}
}
schema_ptr try_get_internal_table(const data_dictionary::database& db, std::string_view table_name) {
size_t it = table_name.find(executor::INTERNAL_TABLE_PREFIX);
if (it != 0) {
return schema_ptr{};
}
table_name.remove_prefix(executor::INTERNAL_TABLE_PREFIX.size());
size_t delim = table_name.find_first_of('.');
if (delim == std::string_view::npos) {
return schema_ptr{};
}
std::string_view ks_name = table_name.substr(0, delim);
table_name.remove_prefix(ks_name.size() + 1);
// Only internal keyspaces can be accessed to avoid leakage
auto ks = db.try_find_keyspace(ks_name);
if (!ks || !ks->is_internal()) {
return schema_ptr{};
}
try {
return db.find_schema(ks_name, table_name);
} catch (data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(
fmt::format("Requested resource not found: Internal table: {}.{} not found", ks_name, table_name));
}
}
schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request) {
sstring table_name = rjson::to_sstring(batch_request->name); // JSON keys are always strings
try {
return proxy.data_dictionary().find_schema(sstring(executor::KEYSPACE_NAME_PREFIX) + table_name, table_name);
} catch(data_dictionary::no_such_column_family&) {
// DynamoDB returns validation error even when table does not exist
// and the table name is invalid.
validate_table_name(table_name);
throw api_error::resource_not_found(format("Requested resource not found: Table: {} not found", table_name));
}
}
lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, const schema& schema) {
try {
replica::table& table = sp.local_db().find_column_family(schema.id());
if (!table.get_stats().alternator_stats) {
table.get_stats().alternator_stats = seastar::make_shared<table_stats>(schema.ks_name(), schema.cf_name());
}
return table.get_stats().alternator_stats->_stats;
} catch (std::runtime_error&) {
// If we're here it means that a table we are currently working on was deleted before the
// operation completed, returning a temporary object is fine, if the table get deleted so will its metrics
return make_lw_shared<stats>();
}
}
void describe_single_item(const cql3::selection::selection& selection,
const std::vector<managed_bytes_opt>& result_row,
const std::optional<attrs_to_get>& attrs_to_get,
rjson::value& item,
uint64_t* item_length_in_bytes,
bool include_all_embedded_attributes)
{
const auto& columns = selection.get_columns();
auto column_it = columns.begin();
for (const managed_bytes_opt& cell : result_row) {
if (!cell) {
++column_it;
continue;
}
std::string column_name = (*column_it)->name_as_text();
if (column_name != executor::ATTRS_COLUMN_NAME) {
if (item_length_in_bytes) {
(*item_length_in_bytes) += column_name.length() + cell->size();
}
if (!attrs_to_get || attrs_to_get->contains(column_name)) {
// item is expected to start empty, and column_name are unique
// so add() makes sense
rjson::add_with_string_name(item, column_name, rjson::empty_object());
rjson::value& field = item[column_name.c_str()];
cell->with_linearized([&] (bytes_view linearized_cell) {
rjson::add_with_string_name(field, type_to_string((*column_it)->type), json_key_column_value(linearized_cell, **column_it));
});
}
} else {
auto deserialized = attrs_type()->deserialize(*cell);
auto keys_and_values = value_cast<map_type_impl::native_type>(deserialized);
for (auto entry : keys_and_values) {
std::string attr_name = value_cast<sstring>(entry.first);
if (item_length_in_bytes) {
(*item_length_in_bytes) += attr_name.length();
}
if (include_all_embedded_attributes || !attrs_to_get || attrs_to_get->contains(attr_name)) {
bytes value = value_cast<bytes>(entry.second);
if (item_length_in_bytes && value.length()) {
// ScyllaDB uses one extra byte compared to DynamoDB for the bytes length
(*item_length_in_bytes) += value.length() - 1;
}
rjson::value v = deserialize_item(value);
if (attrs_to_get) {
auto it = attrs_to_get->find(attr_name);
if (it != attrs_to_get->end()) {
// attrs_to_get may have asked for only part of
// this attribute. hierarchy_filter() modifies v,
// and returns false when nothing is to be kept.
if (!hierarchy_filter(v, it->second)) {
continue;
}
}
}
// item is expected to start empty, and attribute
// names are unique so add() makes sense
rjson::add_with_string_name(item, attr_name, std::move(v));
} else if (item_length_in_bytes) {
(*item_length_in_bytes) += value_cast<bytes>(entry.second).length() - 1;
}
}
}
++column_it;
}
}
std::optional<rjson::value> describe_single_item(schema_ptr schema,
const query::partition_slice& slice,
const cql3::selection::selection& selection,
const query::result& query_result,
const std::optional<attrs_to_get>& attrs_to_get,
uint64_t* item_length_in_bytes) {
rjson::value item = rjson::empty_object();
cql3::selection::result_set_builder builder(selection, gc_clock::now());
query::result_view::consume(query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, selection));
auto result_set = builder.build();
if (result_set->empty()) {
if (item_length_in_bytes) {
// empty results is counted as having a minimal length (e.g. 1 byte).
(*item_length_in_bytes) += 1;
}
// If there is no matching item, we're supposed to return an empty
// object without an Item member - not one with an empty Item member
return {};
}
if (result_set->size() > 1) {
// If the result set contains multiple rows, the code should have
// called describe_multi_item(), not this function.
throw std::logic_error("describe_single_item() asked to describe multiple items");
}
describe_single_item(selection, *result_set->rows().begin(), attrs_to_get, item, item_length_in_bytes);
return item;
}
static void check_big_array(const rjson::value& val, int& size_left);
static void check_big_object(const rjson::value& val, int& size_left);
// For simplicity, we use a recursive implementation. This is fine because
// Alternator limits the depth of JSONs it reads from inputs, and doesn't
// add more than a couple of levels in its own output construction.
bool is_big(const rjson::value& val, int big_size) {
if (val.IsString()) {
return ssize_t(val.GetStringLength()) > big_size;
} else if (val.IsObject()) {
check_big_object(val, big_size);
return big_size < 0;
} else if (val.IsArray()) {
check_big_array(val, big_size);
return big_size < 0;
}
return false;
}
static void check_big_array(const rjson::value& val, int& size_left) {
// Assume a fixed size of 10 bytes for each number, boolean, etc., or
// beginning of a sub-object. This doesn't have to be accurate.
size_left -= 10 * val.Size();
for (const auto& v : val.GetArray()) {
if (size_left < 0) {
return;
}
// Note that we avoid recursive calls for the leaves (anything except
// array or object) because usually those greatly outnumber the trunk.
if (v.IsString()) {
size_left -= v.GetStringLength();
} else if (v.IsObject()) {
check_big_object(v, size_left);
} else if (v.IsArray()) {
check_big_array(v, size_left);
}
}
}
static void check_big_object(const rjson::value& val, int& size_left) {
size_left -= 10 * val.MemberCount();
for (const auto& m : val.GetObject()) {
if (size_left < 0) {
return;
}
size_left -= m.name.GetStringLength();
if (m.value.IsString()) {
size_left -= m.value.GetStringLength();
} else if (m.value.IsObject()) {
check_big_object(m.value, size_left);
} else if (m.value.IsArray()) {
check_big_array(m.value, size_left);
}
}
}
void validate_table_name(std::string_view name, const char* source) {
if (name.length() < 3 || name.length() > max_table_name_length) {
throw api_error::validation(
format("{} must be at least 3 characters long and at most {} characters long", source, max_table_name_length));
}
if (!valid_table_name_chars(name)) {
throw api_error::validation(
format("{} must satisfy regular expression pattern: [a-zA-Z0-9_.-]+", source));
}
}
void validate_cdc_log_name_length(std::string_view table_name) {
if (cdc::log_name(table_name).length() > max_auxiliary_table_name_length) {
// CDC will add cdc_log_suffix ("_scylla_cdc_log") to the table name
// to create its log table, and this will exceed the maximum allowed
// length. To provide a more helpful error message, we assume that
// cdc::log_name() always adds a suffix of the same length.
int suffix_len = cdc::log_name(table_name).length() - table_name.length();
throw api_error::validation(fmt::format("Streams or vector search cannot be enabled on a table whose name is longer than {} characters: {}",
max_auxiliary_table_name_length - suffix_len, table_name));
}
}
body_writer make_streamed(rjson::value&& value) {
return [value = std::move(value)](output_stream<char>&& _out) mutable -> future<> {
auto out = std::move(_out);
std::exception_ptr ex;
try {
co_await rjson::print(value, out);
} catch (...) {
ex = std::current_exception();
}
co_await out.close();
co_await rjson::destroy_gently(std::move(value));
if (ex) {
co_await coroutine::return_exception_ptr(std::move(ex));
}
};
}
} // namespace alternator

247
alternator/executor_util.hh Normal file
View File

@@ -0,0 +1,247 @@
/*
* Copyright 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
// This header file, and the implementation file executor_util.cc, contain
// various utility functions that are reused in many different operations
// (API requests) across Alternator's code - in files such as executor.cc,
// executor_read.cc, streams.cc, ttl.cc, and more. These utility functions
// include things like extracting and validating pieces from a JSON request,
// checking permissions, constructing auxiliary table names, and more.
#pragma once
#include <map>
#include <optional>
#include <string>
#include <string_view>
#include <unordered_map>
#include <unordered_set>
#include <seastar/core/future.hh>
#include <seastar/util/noncopyable_function.hh>
#include "utils/rjson.hh"
#include "schema/schema_fwd.hh"
#include "types/types.hh"
#include "auth/permission.hh"
#include "alternator/stats.hh"
#include "alternator/attribute_path.hh"
#include "utils/managed_bytes.hh"
namespace query { class partition_slice; class result; }
namespace cql3::selection { class selection; }
namespace data_dictionary { class database; }
namespace service { class storage_proxy; class client_state; }
namespace alternator {
/// The body_writer is used for streaming responses - where the response body
/// is written in chunks to the output_stream. This allows for efficient
/// handling of large responses without needing to allocate a large buffer in
/// memory. It is one of the variants of executor::request_return_type.
using body_writer = noncopyable_function<future<>(output_stream<char>&&)>;
/// Get the value of an integer attribute, or an empty optional if it is
/// missing. If the attribute exists, but is not an integer, a descriptive
/// api_error is thrown.
std::optional<int> get_int_attribute(const rjson::value& value, std::string_view attribute_name);
/// Get the value of a string attribute, or a default value if it is missing.
/// If the attribute exists, but is not a string, a descriptive api_error is
/// thrown.
std::string get_string_attribute(const rjson::value& value, std::string_view attribute_name, const char* default_return);
/// Get the value of a boolean attribute, or a default value if it is missing.
/// If the attribute exists, but is not a bool, a descriptive api_error is
/// thrown.
bool get_bool_attribute(const rjson::value& value, std::string_view attribute_name, bool default_return);
/// Extract table name from a request.
/// Most requests expect the table's name to be listed in a "TableName" field.
/// get_table_name() returns the name or api_error in case the table name is
/// missing or not a string.
std::string get_table_name(const rjson::value& request);
/// find_table_name() is like get_table_name() except that it returns an
/// optional table name - it returns an empty optional when the TableName
/// is missing from the request, instead of throwing as get_table_name()
/// does. However, find_table_name() still throws if a TableName exists but
/// is not a string.
std::optional<std::string> find_table_name(const rjson::value& request);
/// Extract table schema from a request.
/// Many requests expect the table's name to be listed in a "TableName" field
/// and need to look it up as an existing table. The get_table() function
/// does this, with the appropriate validation and api_error in case the table
/// name is missing, invalid or the table doesn't exist. If everything is
/// successful, it returns the table's schema.
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
/// This find_table() variant is like get_table() excepts that it returns a
/// nullptr instead of throwing if the request does not mention a TableName.
/// In other cases of errors (i.e., a table is mentioned but doesn't exist)
/// this function throws too.
schema_ptr find_table(service::storage_proxy& proxy, const rjson::value& request);
/// This find_table() variant is like the previous one except that it takes
/// the table name directly instead of a request object. It is used in cases
/// where we already have the table name extracted from the request.
schema_ptr find_table(service::storage_proxy& proxy, std::string_view table_name);
// We would have liked to support table names up to 255 bytes, like DynamoDB.
// But Scylla creates a directory whose name is the table's name plus 33
// bytes (dash and UUID), and since directory names are limited to 255 bytes,
// we need to limit table names to 222 bytes, instead of 255. See issue #4480.
// We actually have two limits here,
// * max_table_name_length is the limit that Alternator will impose on names
// of new Alternator tables.
// * max_auxiliary_table_name_length is the potentially higher absolute limit
// that Scylla imposes on the names of auxiliary tables that Alternator
// wants to create internally - i.e. materialized views or CDC log tables.
// The second limit might mean that it is not possible to add a GSI to an
// existing table, because the name of the new auxiliary table may go over
// the limit. The second limit is also one of the reasons why the first limit
// is set lower than 222 - to have room to enable streams which add the extra
// suffix "_scylla_cdc_log" to the table name.
inline constexpr int max_table_name_length = 192;
inline constexpr int max_auxiliary_table_name_length = 222;
/// validate_table_name() validates the TableName parameter in a request - it
/// should be called in CreateTable, and in other requests only when noticing
/// that the named table doesn't exist.
/// The DynamoDB developer guide, https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.NamingRulesDataTypes.html#HowItWorks.NamingRules
/// specifies that table "names must be between 3 and 255 characters long and
/// can contain only the following characters: a-z, A-Z, 0-9, _ (underscore),
/// - (dash), . (dot)". However, Alternator only allows max_table_name_length
/// characters (see above) - not 255.
/// validate_table_name() throws the appropriate api_error if this validation
/// fails.
void validate_table_name(std::string_view name, const char* source = "TableName");
/// Validate that a CDC log table could be created for the base table with a
/// given table_name, and if not, throw a user-visible api_error::validation.
/// It is not possible to create a CDC log table if the table name is so long
/// that adding the 15-character suffix "_scylla_cdc_log" (cdc_log_suffix)
/// makes it go over max_auxiliary_table_name_length.
/// Note that if max_table_name_length is set to less than 207 (which is
/// max_auxiliary_table_name_length-15), then this function will never
/// fail. However, it's still important to call it in UpdateTable, in case
/// we have pre-existing tables with names longer than this to avoid #24598.
void validate_cdc_log_name_length(std::string_view table_name);
/// Checks if a keyspace, given by its name, is an Alternator keyspace.
/// This just checks if the name begins in executor::KEYSPACE_NAME_PREFIX,
/// a prefix that all keyspaces created by Alternator's CreateTable use.
bool is_alternator_keyspace(std::string_view ks_name);
/// Wraps db::get_tags_of_table() and throws api_error::validation if the
/// table is missing the tags extension.
const std::map<sstring, sstring>& get_tags_of_table_or_throw(schema_ptr schema);
/// Returns a type object representing the type of the ":attrs" column used
/// by Alternator to store all non-key attribute. This type is a map from
/// string (attribute name) to bytes (serialized attribute value).
map_type attrs_type();
// In DynamoDB index names are local to a table, while in Scylla, materialized
// view names are global (in a keyspace). So we need to compose a unique name
// for the view taking into account both the table's name and the index name.
// We concatenate the table and index name separated by a delim character
// (a character not allowed by DynamoDB in ordinary table names, default: ":").
// The downside of this approach is that it limits the sum of the lengths,
// instead of each component individually as DynamoDB does.
// The view_name() function assumes the table_name has already been validated
// but validates the legality of index_name and the combination of both.
std::string view_name(std::string_view table_name, std::string_view index_name,
const std::string& delim = ":", bool validate_len = true);
std::string gsi_name(std::string_view table_name, std::string_view index_name,
bool validate_len = true);
std::string lsi_name(std::string_view table_name, std::string_view index_name,
bool validate_len = true);
/// After calling pk_from_json() and ck_from_json() to extract the pk and ck
/// components of a key, and if that succeeded, call check_key() to further
/// check that the key doesn't have any spurious components.
void check_key(const rjson::value& key, const schema_ptr& schema);
/// Fail with api_error::validation if the expression if has unused attribute
/// names or values. This is how DynamoDB behaves, so we do too.
void verify_all_are_used(const rjson::value* field,
const std::unordered_set<std::string>& used,
const char* field_name,
const char* operation);
/// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
/// SELECT, DROP, etc.) on the given table. When permission is denied an
/// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, stats& stats);
/// Similar to verify_permission() above, but just for CREATE operations.
/// Those do not operate on any specific table, so require permissions on
/// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, stats& stats);
// Sets a KeySchema JSON array inside the given parent object describing the
// key attributes of the given schema as HASH or RANGE keys. Additionally,
// adds mappings from key attribute names to their DynamoDB type string into
// attribute_types.
void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string, std::string>* attribute_types = nullptr, const std::map<sstring, sstring>* tags = nullptr);
/// is_big() checks approximately if the given JSON value is "bigger" than
/// the given big_size number of bytes. The goal is to *quickly* detect
/// oversized JSON that, for example, is too large to be serialized to a
/// contiguous string - we don't need an accurate size for that. Moreover,
/// as soon as we detect that the JSON is indeed "big", we can return true
/// and don't need to continue calculating its exact size.
bool is_big(const rjson::value& val, int big_size = 100'000);
/// try_get_internal_table() handles the special case that the given table_name
/// begins with INTERNAL_TABLE_PREFIX (".scylla.alternator."). In that case,
/// this function assumes that the rest of the name refers to an internal
/// Scylla table (e.g., system table) and returns the schema of that table -
/// or an exception if it doesn't exist. Otherwise, if table_name does not
/// start with INTERNAL_TABLE_PREFIX, this function returns an empty schema_ptr
/// and the caller should look for a normal Alternator table with that name.
schema_ptr try_get_internal_table(const data_dictionary::database& db, std::string_view table_name);
/// get_table_from_batch_request() is used by batch write/read operations to
/// look up the schema for a table named in a batch request, by the JSON member
/// name (which is the table name in a BatchWriteItem or BatchGetItem request).
schema_ptr get_table_from_batch_request(const service::storage_proxy& proxy, const rjson::value::ConstMemberIterator& batch_request);
/// Returns (or lazily creates) the per-table stats object for the given schema.
/// If the table has been deleted, returns a temporary stats object.
lw_shared_ptr<stats> get_stats_from_schema(service::storage_proxy& sp, const schema& schema);
/// Writes one item's attributes into `item` from the given selection result
/// row. If include_all_embedded_attributes is true, all attributes from the
/// ATTRS_COLUMN map column are included regardless of attrs_to_get.
void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
const std::optional<attrs_to_get>&,
rjson::value&,
uint64_t* item_length_in_bytes = nullptr,
bool include_all_embedded_attributes = false);
/// Converts a single result row to a JSON item, or returns an empty optional
/// if the result is empty.
std::optional<rjson::value> describe_single_item(schema_ptr,
const query::partition_slice&,
const cql3::selection::selection&,
const query::result&,
const std::optional<attrs_to_get>&,
uint64_t* item_length_in_bytes = nullptr);
/// Make a body_writer (function that can write output incrementally to the
/// HTTP stream) from the given JSON object.
/// Note: only useful for (very) large objects as there are overhead issues
/// with this as well, but for massive lists of return objects this can
/// help avoid large allocations/many re-allocs.
body_writer make_streamed(rjson::value&&);
} // namespace alternator

View File

@@ -744,7 +744,7 @@ void validate_attr_name_length(std::string_view supplementary_context, size_t at
constexpr const size_t DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX = 65535;
const size_t max_length = is_key ? DYNAMODB_KEY_ATTR_NAME_SIZE_MAX : DYNAMODB_NONKEY_ATTR_NAME_SIZE_MAX;
if (attr_name_length > max_length) {
if (attr_name_length > max_length || attr_name_length == 0) {
std::string error_msg;
if (!error_msg_prefix.empty()) {
error_msg += error_msg_prefix;
@@ -754,7 +754,11 @@ void validate_attr_name_length(std::string_view supplementary_context, size_t at
error_msg += supplementary_context;
error_msg += " - ";
}
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
if (attr_name_length == 0) {
error_msg += "Empty attribute name";
} else {
error_msg += fmt::format("Attribute name is too large, must be less than {} bytes", std::to_string(max_length + 1));
}
throw api_error::validation(error_msg);
}
}

View File

@@ -264,7 +264,7 @@ private:
}
};
executor::body_writer compress(response_compressor::compression_type ct, const db::config& cfg, executor::body_writer&& bw) {
body_writer compress(response_compressor::compression_type ct, const db::config& cfg, body_writer&& bw) {
return [bw = std::move(bw), ct, level = cfg.alternator_response_gzip_compression_level()](output_stream<char>&& out) mutable -> future<> {
output_stream_options opts;
opts.trim_to_size = true;
@@ -287,7 +287,7 @@ executor::body_writer compress(response_compressor::compression_type ct, const d
};
}
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer) {
future<std::unique_ptr<http::reply>> response_compressor::generate_reply(std::unique_ptr<http::reply> rep, sstring accept_encoding, const char* content_type, body_writer&& body_writer) {
response_compressor::compression_type ct = find_compression(accept_encoding, std::numeric_limits<size_t>::max());
if (ct != response_compressor::compression_type::none) {
rep->add_header("Content-Encoding", get_encoding_name(ct));

View File

@@ -85,7 +85,7 @@ public:
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, const char* content_type, std::string&& response_body);
future<std::unique_ptr<http::reply>> generate_reply(std::unique_ptr<http::reply> rep,
sstring accept_encoding, const char* content_type, executor::body_writer&& body_writer);
sstring accept_encoding, const char* content_type, body_writer&& body_writer);
};
}

View File

@@ -14,12 +14,12 @@
#include "types/concrete_types.hh"
#include "types/json_utils.hh"
#include "mutation/position_in_partition.hh"
#include "alternator/executor_util.hh"
static logging::logger slogger("alternator-serialization");
namespace alternator {
bool is_alternator_keyspace(const sstring& ks_name);
type_info type_info_from_string(std::string_view type) {
static thread_local const std::unordered_map<std::string_view, type_info> type_infos = {

View File

@@ -7,6 +7,8 @@
*/
#include "alternator/server.hh"
#include "audit/audit.hh"
#include "alternator/executor_util.hh"
#include "gms/application_state.hh"
#include "utils/log.hh"
#include <fmt/ranges.h>
@@ -142,7 +144,7 @@ public:
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
REPLY_CONTENT_TYPE, std::move(str));
},
[&] (executor::body_writer&& body_writer) {
[&] (body_writer&& body_writer) {
return _response_compressor.generate_reply(std::move(rep), std::move(accept_encoding),
REPLY_CONTENT_TYPE, std::move(body_writer));
},
@@ -785,12 +787,25 @@ future<executor::request_return_type> server::handle_api_request(std::unique_ptr
auto f = [this, content = std::move(content), &callback = callback_it->second,
client_state = std::move(client_state), trace_state = std::move(trace_state),
units = std::move(units), req = std::move(req)] () mutable -> future<executor::request_return_type> {
rjson::value json_request = co_await _json_parser.parse(std::move(content));
if (!json_request.IsObject()) {
co_return api_error::validation("Request content must be an object");
}
co_return co_await callback(_executor, client_state, trace_state,
make_service_permit(std::move(units)), std::move(json_request), std::move(req));
rjson::value json_request = co_await _json_parser.parse(std::move(content));
if (!json_request.IsObject()) {
co_return api_error::validation("Request content must be an object");
}
std::unique_ptr<audit::audit_info_alternator> audit_info;
std::exception_ptr ex = {};
executor::request_return_type ret;
try {
ret = co_await callback(_executor, client_state, trace_state, make_service_permit(std::move(units)), std::move(json_request), std::move(req), audit_info);
} catch (...) {
ex = std::current_exception();
}
if (audit_info) {
co_await audit::inspect(*audit_info, client_state, ex != nullptr);
}
if (ex) {
co_return coroutine::exception(std::move(ex));
}
co_return ret;
};
co_return co_await _sl_controller.with_user_service_level(user, std::ref(f));
}
@@ -834,77 +849,77 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
, _pending_requests("alternator::server::pending_requests")
, _timeout_config(_proxy.data_dictionary().get_config())
, _callbacks{
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"CreateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.create_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DescribeTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DeleteTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.delete_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"UpdateTable", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_table(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"PutItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.put_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"UpdateItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"GetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"GetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DeleteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.delete_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"DeleteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.delete_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"ListTables", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tables(client_state, std::move(permit), std::move(json_request));
{"ListTables", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_tables(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"Scan", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.scan(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"Scan", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.scan(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_endpoints(client_state, std::move(permit), std::move(json_request), req->get_header("Host"));
{"DescribeEndpoints", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_endpoints(client_state, std::move(permit), std::move(json_request), req->get_header("Host"), audit_info);
}},
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_write_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"BatchWriteItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.batch_write_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.batch_get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"BatchGetItem", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.batch_get_item(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"Query", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.query(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"Query", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.query(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"TagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.tag_resource(client_state, std::move(permit), std::move(json_request));
{"TagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.tag_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"UntagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.untag_resource(client_state, std::move(permit), std::move(json_request));
{"UntagResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.untag_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request));
{"ListTagsOfResource", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_tags_of_resource(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.update_time_to_live(client_state, std::move(permit), std::move(json_request));
{"UpdateTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.update_time_to_live(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request));
{"DescribeTimeToLive", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_time_to_live(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.list_streams(client_state, std::move(permit), std::move(json_request));
{"ListStreams", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.list_streams(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeStream", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_stream(client_state, std::move(permit), std::move(json_request));
{"DescribeStream", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_stream(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"GetShardIterator", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_shard_iterator(client_state, std::move(permit), std::move(json_request));
{"GetShardIterator", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_shard_iterator(client_state, std::move(permit), std::move(json_request), audit_info);
}},
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request));
{"GetRecords", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.get_records(client_state, std::move(trace_state), std::move(permit), std::move(json_request), audit_info);
}},
{"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req) {
return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request));
{"DescribeContinuousBackups", [] (executor& e, executor::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value json_request, std::unique_ptr<request> req, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
return e.describe_continuous_backups(client_state, std::move(permit), std::move(json_request), audit_info);
}},
} {
}

View File

@@ -34,7 +34,7 @@ class server : public peering_sharded_service<server> {
// DynamoDB also has the same limit set to 16 MB.
static constexpr size_t request_content_length_limit = 16*MB;
using alternator_callback = std::function<future<executor::request_return_type>(executor&, executor::client_state&,
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>)>;
tracing::trace_state_ptr, service_permit, rjson::value, std::unique_ptr<http::request>, std::unique_ptr<audit::audit_info_alternator>&)>;
using alternator_callbacks_map = std::unordered_map<std::string_view, alternator_callback>;
httpd::http_server _http_server;

View File

@@ -7,6 +7,8 @@
*/
#include <type_traits>
#include <ranges>
#include <generator>
#include <boost/lexical_cast.hpp>
#include <boost/io/ios_state.hpp>
#include <boost/multiprecision/cpp_int.hpp>
@@ -24,12 +26,15 @@
#include "cql3/selection/selection.hh"
#include "cql3/result_set.hh"
#include "cql3/column_identifier.hh"
#include "replica/database.hh"
#include "schema/schema_builder.hh"
#include "service/storage_proxy.hh"
#include "gms/feature.hh"
#include "gms/feature_service.hh"
#include "executor.hh"
#include "streams.hh"
#include "alternator/executor_util.hh"
#include "data_dictionary/data_dictionary.hh"
#include "utils/rjson.hh"
@@ -91,45 +96,117 @@ static sstring stream_label(const schema& log_schema) {
return seastar::json::formatter::to_json(tm);
}
namespace alternator {
// Debug printer for cdc::stream_id - used only for logging/debugging, not for
// serialization or user-visible output. We print both signed and unsigned value
// as we use both.
template <>
struct fmt::formatter<cdc::stream_id> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const cdc::stream_id &id, FormatContext& ctx) const {
fmt::format_to(ctx.out(), "{} ", id.token());
// stream arn _has_ to be 37 or more characters long. ugh...
// see https://docs.aws.amazon.com/amazondynamodb/latest/APIReference/API_streams_DescribeStream.html#API_streams_DescribeStream_RequestSyntax
for (auto b : id.to_bytes()) {
fmt::format_to(ctx.out(), "{:02x}", (unsigned char)b);
}
return ctx.out();
}
};
namespace alternator {
// stream arn has certain format (see https://docs.aws.amazon.com/IAM/latest/UserGuide/reference-arns.html)
// we need to follow it as Kinesis Client Library does check
// NOTE: we're holding inside a name of cdc log table, not a user table
class stream_arn {
std::string _arn;
size_t _table_name_offset, _table_name_size;
size_t _keyspace_name_offset, _keyspace_name_size;
void _initialize_offsets() {
auto parts = parse_arn(_arn, "StreamArn", "stream", "/stream/");
_table_name_offset = parts.table_name.data() - _arn.data();
_table_name_size = parts.table_name.size();
_keyspace_name_offset = parts.keyspace_name.data() - _arn.data();
_keyspace_name_size = parts.keyspace_name.size();
}
public:
// ARN to get table name from
stream_arn(std::string arn) : _arn(std::move(arn)) {
_initialize_offsets();
}
// NOTE: it must be a schema of a CDC log table, not a base table, because that's what we are encoding in ARN and returning to users.
// we need base schema for creation time
stream_arn(schema_ptr s, schema_ptr base_schema) {
auto creation_time = get_table_creation_time(*base_schema);
auto now = std::chrono::system_clock::time_point{ std::chrono::duration_cast<std::chrono::system_clock::duration>(std::chrono::duration<double>(creation_time)) };
// KCL checks for arn / aws / dynamodb and account-id being a number
_arn = fmt::format("arn:aws:dynamodb:us-east-1:000000000000:table/{}@{}/stream/{:%FT%T}", s->ks_name(), s->cf_name(), now);
_initialize_offsets();
}
std::string_view unparsed() const { return _arn; }
std::string_view table_name() const { return std::string_view{ _arn }.substr(_table_name_offset, _table_name_size); }
std::string_view keyspace_name() const { return std::string_view{ _arn }.substr(_keyspace_name_offset, _keyspace_name_size); }
friend std::ostream& operator<<(std::ostream& os, const stream_arn& arn) {
os << arn._arn;
return os;
}
};
// NOTE: this will return schema for cdc log table, not the base table.
static schema_ptr get_schema_from_arn(service::storage_proxy& proxy, const stream_arn& arn)
{
if (!cdc::is_log_name(arn.table_name())) {
throw api_error::resource_not_found(fmt::format("{} as found in ARN {} is not a valid name for a CDC table", arn.table_name(), arn.unparsed()));
}
try {
return proxy.data_dictionary().find_schema(arn.keyspace_name(), arn.table_name());
} catch(data_dictionary::no_such_column_family&) {
throw api_error::resource_not_found(fmt::format("`{}` is not a valid StreamArn - table {} not found", arn.unparsed(), arn.table_name()));
}
}
// ShardId. Must be between 28 and 65 characters inclusive.
// UUID is 36 bytes as string (including dashes).
// Prepend a version/type marker -> 37
class stream_arn : public utils::UUID {
// Prepend a version/type marker (`S`) -> 37
class stream_shard_id : public utils::UUID {
public:
using UUID = utils::UUID;
static constexpr char marker = 'S';
stream_arn() = default;
stream_arn(const UUID& uuid)
stream_shard_id() = default;
stream_shard_id(const UUID& uuid)
: UUID(uuid)
{}
stream_arn(const table_id& tid)
stream_shard_id(const table_id& tid)
: UUID(tid.uuid())
{}
stream_arn(std::string_view v)
stream_shard_id(std::string_view v)
: UUID(v.substr(1))
{
if (v[0] != marker) {
throw std::invalid_argument(std::string(v));
}
}
friend std::ostream& operator<<(std::ostream& os, const stream_arn& arn) {
friend std::ostream& operator<<(std::ostream& os, const stream_shard_id& arn) {
const UUID& uuid = arn;
return os << marker << uuid;
}
friend std::istream& operator>>(std::istream& is, stream_arn& arn) {
friend std::istream& operator>>(std::istream& is, stream_shard_id& arn) {
std::string s;
is >> s;
arn = stream_arn(s);
arn = stream_shard_id(s);
return is;
}
};
} // namespace alternator
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_shard_id>
: public from_string_helper<ValueType, alternator::stream_shard_id>
{};
template<typename ValueType>
struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>
: public from_string_helper<ValueType, alternator::stream_arn>
@@ -137,11 +214,11 @@ struct rapidjson::internal::TypeHelper<ValueType, alternator::stream_arn>
namespace alternator {
future<alternator::executor::request_return_type> alternator::executor::list_streams(client_state& client_state, service_permit permit, rjson::value request) {
future<alternator::executor::request_return_type> alternator::executor::list_streams(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.list_streams++;
auto limit = rjson::get_opt<int>(request, "Limit").value_or(100);
auto streams_start = rjson::get_opt<stream_arn>(request, "ExclusiveStartStreamArn");
auto streams_start = rjson::get_opt<stream_shard_id>(request, "ExclusiveStartStreamArn");
auto table = find_table(_proxy, request);
auto db = _proxy.data_dictionary();
@@ -149,6 +226,11 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
throw api_error::validation("Limit must be 1 or more");
}
// Audit the input table name (if specified), not the output table names.
maybe_audit(audit_info, audit::statement_category::QUERY,
table ? table->ks_name() : "", table ? table->cf_name() : "",
"ListStreams", request);
std::vector<data_dictionary::table> cfs;
if (table) {
@@ -189,26 +271,23 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
auto ret = rjson::empty_object();
auto streams = rjson::empty_array();
std::optional<stream_arn> last;
std::optional<stream_shard_id> last;
for (;limit > 0 && i != e; ++i) {
auto s = i->schema();
auto& ks_name = s->ks_name();
auto& cf_name = s->cf_name();
if (!is_alternator_keyspace(ks_name)) {
continue;
}
if (cdc::is_log_for_some_table(db.real_database(), ks_name, cf_name)) {
rjson::value new_entry = rjson::empty_object();
last = i->schema()->id();
rjson::add(new_entry, "StreamArn", *last);
auto arn = stream_arn{ i->schema(), cdc::get_base_table(db.real_database(), *i->schema()) };
rjson::add(new_entry, "StreamArn", arn);
rjson::add(new_entry, "StreamLabel", rjson::from_string(stream_label(*s)));
rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(table_name(*s))));
rjson::add(new_entry, "TableName", rjson::from_string(cdc::base_name(s->cf_name())));
rjson::push_back(streams, std::move(new_entry));
--limit;
}
}
@@ -218,7 +297,6 @@ future<alternator::executor::request_return_type> alternator::executor::list_str
if (last) {
rjson::add(ret, "LastEvaluatedStreamArn", *last);
}
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
@@ -430,7 +508,7 @@ using namespace std::chrono_literals;
// Dynamo docs says no data shall live longer than 24h.
static constexpr auto dynamodb_streams_max_window = 24h;
// find the parent shard in previous generation for the given child shard
// find the parent Streams shard in previous generation for the given child Streams shard
// takes care of wrap-around case in vnodes
// prev_streams must be sorted by token
const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_point prev_timestamp, const utils::chunked_vector<cdc::stream_id> &prev_streams, const cdc::stream_id &child) {
@@ -449,7 +527,305 @@ const cdc::stream_id& find_parent_shard_in_previous_generation(db_clock::time_po
return *it;
}
future<executor::request_return_type> executor::describe_stream(client_state& client_state, service_permit permit, rjson::value request) {
// The function compare_lexicographically() below sorts stream shard ids in the
// way we need to present them in our output. However, when processing lists of
// shards internally, especially for finding child shards, it's more convenient
// for us to sort the shard ids by the different function defined here -
// compare_by_token(). It sorts the ids by numeric token (the end token of the
// token range belonging to this shard), and makes algorithms like lower_bound()
// possible.
static bool compare_by_token(const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
}
// #7409 - shards must be returned in lexicographical order.
// Normal bytes compare is string_traits<int8_t>::compare,
// thus bytes 0x8000 is less than 0x0000. Instead, we need to use unsigned compare.
// KCL depends on this ordering, so we need to adhere.
static bool compare_lexicographically(const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
}
stream_id_range::stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1) : stream_id_range(items, lo1, end1, items.end(), items.end()) {}
stream_id_range::stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1,
utils::chunked_vector<cdc::stream_id>::iterator lo2,
utils::chunked_vector<cdc::stream_id>::iterator end2)
: _lo1(lo1)
, _end1(end1)
, _lo2(lo2)
, _end2(end2)
{
if (_lo2 != items.end()) {
if (_lo1 != items.begin()) {
on_internal_error(slogger, fmt::format("Invalid stream_id_range: _lo1 != items.begin()"));
}
if (_end2 != items.end()) {
on_internal_error(slogger, fmt::format("Invalid stream_id_range: _end2 != items.end()"));
}
}
if (_end1 > _lo2)
on_internal_error(slogger, fmt::format("Invalid stream_id_range: _end1 > _lo2"));
}
void stream_id_range::set_starting_position(const cdc::stream_id &update_to) {
_skip_to = &update_to;
}
void stream_id_range::prepare_for_iterating()
{
if (_prepared) return;
_prepared = true;
// here we deal with unfortunate possibility of wrap around range - in which case we actually have
// two ranges (lo1, end1) and (lo2, end2), where lo1 will be begin() and end2 will be end().
// the whole range needs to be sorted by `compare_lexicographically`, so we have to manually merge two ranges together and then sort them.
// We also need to apply starting position update, if it was set, after merging and sorting.
if (_end1 > _lo2)
on_internal_error(slogger, fmt::format("Invalid stream_id_range: _end1 > _lo2"));
auto tgt = _end1;
auto src = _lo2;
// just try to move second range just after first one - if we have only one range,
// second range will be empty and nothing will happen here
for(; src != _end2; ++src, ++tgt) {
std::swap(*tgt, *src);
}
// sort merged ranges by compare_lexicographically
std::sort(_lo1, tgt, compare_lexicographically);
// apply starting position update if it was set
// as a sanity check we require to find EXACT token match
if (_skip_to) {
auto it = std::lower_bound(_lo1, tgt, *_skip_to, compare_lexicographically);
if (it == tgt || it->token() != _skip_to->token()) {
slogger.info("Could not find starting position update shard id {}", *_skip_to);
} else {
_lo1 = std::next(it);
}
}
_end1 = tgt;
}
// the function returns `stream_id_range` that will allow iteration over children Streams shards for the Streams shard `parent`
// a child Streams shard is defined as a Streams shard that touches token range that was previously covered by `parent` Streams shard
// Streams shard contains a token, that represents end of the token range for that Streams shard (inclusive)
// begginning of the token range is defined by previous Streams shard's token + 1
// NOTE: With vnodes, ranges of Streams' shards wrap, while with tablets the biggest allowed token number is always a range end.
// NOTE: both streams generation are guaranteed to cover whole range and be non-empty
// NOTE: it's possible to get more than one stream shard with the same token value (thus some of those stream shards will be empty) -
// for simplicity we will emit empty stream shards as well.
//
// to find children we will first find parent Streams shard in parent_streams by its token
// then we will find previous Streams shard in parent stream - that will determine range
// then based on the range we will find children Streams shards in current_streams
// NOTE: function sorts / reorders current_streams
// NOTE: function assumes parent_streams is sorted by compare_by_token and it doesn't modify it
stream_id_range find_children_range_from_parent_token(
const utils::chunked_vector<cdc::stream_id>& parent_streams,
utils::chunked_vector<cdc::stream_id>& current_streams,
cdc::stream_id parent,
bool uses_tablets
) {
// sanity checks for required preconditions
if (parent_streams.empty()) {
on_internal_error(slogger, fmt::format("parent_streams is empty") );
}
if (current_streams.empty()) {
on_internal_error(slogger, fmt::format("current_streams is empty") );
}
// first let's cover obvious cases
// if we have only one parent Streams shard, then all children belong to it
if (parent_streams.size() == 1) {
return stream_id_range{ current_streams, current_streams.begin(), current_streams.end() };
}
// if we have only one current Streams shard, then every parent maps to it
if (current_streams.size() == 1) {
return stream_id_range{ current_streams, current_streams.begin(), current_streams.end() };
}
// find parent Streams shard in parent_streams, it must be present and have exact match
auto parent_shard_end_it = std::lower_bound(parent_streams.begin(), parent_streams.end(), parent.token(), [](const cdc::stream_id& id, const dht::token& t) {
return id.token() < t;
});
if (parent_shard_end_it == parent_streams.end() || parent_shard_end_it->token() != parent.token()) {
throw api_error::validation(fmt::format("Invalid ShardFilter.ShardId value - shard {} not found", parent));
}
std::sort(current_streams.begin(), current_streams.end(), compare_by_token);
utils::chunked_vector<cdc::stream_id>::iterator child_shard_begin_it;
// upper_bound gives us the first element with token strictly greater than
// parent's end token - this is the correct one-past-end for an inclusive
// boundary and handles duplicate tokens (multiple children sharing a token)
auto child_shard_end_it = std::upper_bound(current_streams.begin(), current_streams.end(), parent_shard_end_it->token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
if (uses_tablets) {
// tablets version - tablets don't wrap around and last token is always present
// let's assume we've parent (first line) and child generation (second line):
// NOTE: token space doesn't wrap around - instead we have a guarantee that last token
// will be present as one of the shards
// P=| 1 2 3 4|
// C=| a b c d e|
// we want to find children for each token from parent:
// 1 -> a,b
// 2 -> c
// 3 -> d
// 4 -> d, e
// first we find token in P that is end of range of parent - parent_shard_end_it
// - if parent_shard_end_it - 1 exists
// - we take it as parent_shard_begin_it
// - find the first child with token > parent_shard_begin_it and set it to child_shard_begin_it
// - else previous one to parent_shard_end_it does not exist
// - set child_shard_begin_it = C.begin()
// - find the first child with token > parent_shard_end_it and set it to child_shard_end_it
// - range [child_shard_begin_it, child_shard_end_it) represents children
// When the parent's end token is not directly present in the children
// (merge scenario: several parent shards merged into fewer children),
// the child whose range absorbs the parent's end is the first child
// with token > parent_end_token. upper_bound already points there,
// so we advance past it to include it in the [begin, end) range.
if (child_shard_end_it == current_streams.begin() || std::prev(child_shard_end_it)->token() != parent_shard_end_it->token()) {
if (child_shard_end_it == current_streams.end()) {
on_internal_error(slogger, fmt::format("parent end token not present in children tokens and no child with greater token exists, for parent shard id {}, got parent shards [{}] and children shards [{}]",
parent, fmt::join(parent_streams, "; "), fmt::join(current_streams, "; ")));
}
++child_shard_end_it;
}
// end of parent token is also first token in parent streams - it means beginning of the parent's range
// is the beginning of the token space - this means first child stream will be start of the children range
if (parent_shard_end_it == parent_streams.begin()) {
child_shard_begin_it = current_streams.begin();
} else {
// normal case - we have previous parent Streams shard that determines beginning of the range (exclusive)
// upper_bound skips past all children at the previous parent's token (including duplicates)
auto parent_shard_begin_it = std::prev(parent_shard_end_it);
child_shard_begin_it = std::upper_bound(current_streams.begin(), current_streams.end(), parent_shard_begin_it->token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
}
// simple range
return stream_id_range{ current_streams, child_shard_begin_it, child_shard_end_it };
} else {
// vnodes version - vnodes wrap around
// wrapping around make whole algorithm extremely confusing, because we wrap around on two levels,
// both parent Streams shard might wrap around and children range might wrap around as well
// helper function to find a range in current_streams based on range from parent_streams, but without wrap around
// if lo is not set, it means start from beginning of current_streams
// if end is not set, it means go until end of current_streams
auto find_range_in_children = [&](std::optional<utils::chunked_vector<cdc::stream_id>::const_iterator> lo, std::optional<utils::chunked_vector<cdc::stream_id>::const_iterator> end) -> std::pair<utils::chunked_vector<cdc::stream_id>::iterator, utils::chunked_vector<cdc::stream_id>::iterator> {
utils::chunked_vector<cdc::stream_id>::iterator res_lo, res_end;
if (!lo) {
// beginning of the range
res_lo = current_streams.begin();
} else {
// we use upper_bound as beginning of the range is exclusive
res_lo = std::upper_bound(current_streams.begin(), current_streams.end(), (*lo)->token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
}
if (!end) {
// end of the range
res_end = current_streams.end();
} else {
// end of the range is inclusive, so we use upper_bound to find the first element
// with token strictly greater than the end token - this correctly handles the case
// where multiple children share the same token (e.g. small vnodes where several
// shards fall back to the vnode-end token)
res_end = std::upper_bound(current_streams.begin(), current_streams.end(), (*end)->token(), [](const dht::token& t, const cdc::stream_id& id) {
return t < id.token();
});
// When the parent's end token is not directly present in the
// children (merge scenario), the child whose range absorbs the
// parent's end is at res_end. Advance past it so that the
// half-open range [res_lo, res_end) includes it.
if (res_end != current_streams.end() &&
(res_end == current_streams.begin() || std::prev(res_end)->token() != (*end)->token())) {
++res_end;
}
}
return { res_lo, res_end };
};
auto parent_shard_begin_it = parent_shard_end_it;
if (parent_shard_begin_it == parent_streams.begin()) {
// end of the parent Streams shard is also first token in parent streams - it means wrap around case for parent
// beginning of the parent's range is the last token in the parent streams
// for example:
// P=| 0 10 |
// C=| -20 -10 |
// searching for parent Streams shard at 0 will get us here - end of the parent is the first parent Streams shard
// so beginning of the parent's range is the last parent Streams shard (10)
parent_shard_begin_it = std::prev(parent_streams.end());
// we find two unwrapped ranges here - from beginning of current_streams to the end of the parent's range
// (end is inclusive) - in our example it's (-inf, 0]
auto [ lo1, end1 ] = find_range_in_children(std::nullopt, parent_shard_end_it);
// and from the beginning of the parent's range (exclusive) to the end of current_streams
// our example is (10, +inf)
auto [ lo2, end2 ] = find_range_in_children(parent_shard_begin_it, std::nullopt);
// in rare cases those two ranges might overlap - so we check and merge if needed
// for example:
// P=| -30 -20 |
// C=| -40 -10 |
// searching for parent Streams shard at -30 will get us here - end of the parent is -30, beginning is -20
// first search will give us (-inf, +inf) with end1 pointing to current_streams.end()
// (because the range needs to include -10 position, so the iterator will point to the next one after - end of the current_streams)
// second search will give us [-10, +inf) with lo2 pointing to current_streams[1]
// which is less then end1 - so we need to merge those two ranges
if (lo2 < end1) {
assert(lo1 <= lo2);
assert(end1 <= end2);
end1 = end2;
lo2 = end2 = current_streams.end();
}
return stream_id_range{ current_streams, lo1, end1, lo2, end2 };
} else {
// simpler case - parent doesn't wrap around and we have both begin and end in normal order
// we search for single unwrapped range and adjust later if needed
--parent_shard_begin_it;
auto [ lo1, end1 ] = find_range_in_children(parent_shard_begin_it, parent_shard_end_it);
auto lo2 = current_streams.end();
auto end2 = current_streams.end();
// it's possible for simple case to still wrap around, when parent range lies after all children Streams shards
// for example:
// P=| 0 10 |
// C=| -20 -10 |
// when searching for parent shart at 0, we get parent range [0, 10)
// unwrapped search will produce empty range and miss -20 child Streams shard, which is actually
// owner of [0, 10) range (and is also a first Streams shard in current generation)
// note, that searching for 0 parent will give correct result, but because algorithm in that case
// detects wrap around case and chooses different if
if (parent_shard_end_it->token() > current_streams.back().token() && lo1 != current_streams.begin()) {
// wrap around case - children at the beginning of the sorted array
// wrap around the ring and cover the parent's range. Include all
// children sharing the first token (duplicate tokens are possible
// for small vnodes where multiple shards fall back to the same token)
end2 = lo2 = current_streams.begin();
while(end2 != current_streams.end() && end2->token() == current_streams.front().token()) {
++end2;
}
std::swap(lo1, lo2);
std::swap(end1, end2);
}
return stream_id_range{ current_streams, lo1, end1, lo2, end2 };
}
}
}
future<executor::request_return_type> executor::describe_stream(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.describe_stream++;
auto limit = rjson::get_opt<int>(request, "Limit").value_or(100); // according to spec
@@ -459,12 +835,11 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// I.e. unparsable arn -> error.
auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");
schema_ptr schema, bs;
schema_ptr bs;
auto db = _proxy.data_dictionary();
auto schema = get_schema_from_arn(_proxy, stream_arn);
try {
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
bs = cdc::get_base_table(db.real_database(), *schema);
} catch (...) {
}
@@ -472,6 +847,12 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
if (!schema || !bs || !is_alternator_keyspace(schema->ks_name())) {
throw api_error::resource_not_found("Invalid StreamArn");
}
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
// _sdks.cdc_get_versioned_streams() uses quorum_if_many() underneath, which uses CL=QUORUM for many token owners and CL=ONE otherwise.
auto describe_cl = (normal_token_owners > 1) ? db::consistency_level::QUORUM : db::consistency_level::ONE;
maybe_audit(audit_info, audit::statement_category::QUERY, schema->ks_name(),
bs->cf_name() + "|" + schema->cf_name(), "DescribeStream", request, describe_cl);
if (limit < 1) {
throw api_error::validation("Limit must be 1 or more");
@@ -496,6 +877,8 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
} else {
status = "ENABLED";
}
} else if (opts.enable_requested()) {
status = "ENABLING";
}
auto ttl = std::chrono::seconds(opts.ttl());
@@ -504,9 +887,9 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
stream_view_type type = cdc_options_to_steam_view_type(opts);
rjson::add(stream_desc, "StreamArn", alternator::stream_arn(schema->id()));
rjson::add(stream_desc, "StreamArn", stream_arn);
rjson::add(stream_desc, "StreamViewType", type);
rjson::add(stream_desc, "TableName", rjson::from_string(table_name(*bs)));
rjson::add(stream_desc, "TableName", rjson::from_string(bs->cf_name()));
describe_key_schema(stream_desc, *bs);
@@ -518,13 +901,48 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
// TODO: label
// TODO: creation time
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
std::map<db_clock::time_point, cdc::streams_version> topologies;
// filter out cdc generations older than the table or now() - cdc::ttl (typically dynamodb_streams_max_window - 24h)
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
if (schema->table().uses_tablets()) {
// We can't use table creation time here, as tablets might report a
// generation timestamp just before table creation. This is safe
// because CDC generations are per-table and cannot pre-date the
// table, so expanding the window won't pull in unrelated data.
auto low_ts = db_clock::now() - ttl;
topologies = co_await _system_keyspace.read_cdc_for_tablets_versioned_streams(bs->ks_name(), bs->cf_name(), low_ts);
} else {
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
auto low_ts = std::max(as_timepoint(schema->id()), db_clock::now() - ttl);
topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });
}
const auto e = topologies.end();
std::optional<shard_id> shard_filter;
if (const rjson::value *shard_filter_obj = rjson::find(request, "ShardFilter")) {
if (!shard_filter_obj->IsObject()) {
throw api_error::validation("Invalid ShardFilter value - must be object");
}
std::string type;
try {
type = rjson::get<std::string>(*shard_filter_obj, "Type");
} catch (...) {
throw api_error::validation("Invalid ShardFilter.Type value - must be string `CHILD_SHARDS`");
}
if (type != "CHILD_SHARDS") {
throw api_error::validation("Invalid ShardFilter.Type value - must be string `CHILD_SHARDS`");
}
try {
shard_filter = rjson::get<shard_id>(*shard_filter_obj, "ShardId");
} catch (const std::exception &e) {
throw api_error::validation(fmt::format("Invalid ShardFilter.ShardId value - not a valid ShardId: {}", e.what()));
}
if (topologies.find(shard_filter->time) == topologies.end()) {
throw api_error::validation(fmt::format("Invalid ShardFilter.ShardId value - corresponding generation not found: {}", shard_filter->id));
}
}
std::map<db_clock::time_point, cdc::streams_version> topologies = co_await _sdks.cdc_get_versioned_streams(low_ts, { normal_token_owners });
auto e = topologies.end();
auto prev = e;
auto shards = rjson::empty_array();
@@ -536,25 +954,6 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
i = topologies.find(shard_start->time);
}
// for parent-child stuff we need id:s to be sorted by token
// (see explanation above) since we want to find closest
// token boundary when determining parent.
// #7346 - we processed and searched children/parents in
// stored order, which is not necessarily token order,
// so the finding of "closest" token boundary (using upper bound)
// could give somewhat weird results.
static auto token_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return id1.token() < id2.token();
};
// #7409 - shards must be returned in lexicographical order,
// normal bytes compare is string_traits<int8_t>::compare.
// thus bytes 0x8000 is less than 0x0000. By doing unsigned
// compare instead we inadvertently will sort in string lexical.
static auto id_cmp = [](const cdc::stream_id& id1, const cdc::stream_id& id2) {
return compare_unsigned(id1.to_bytes(), id2.to_bytes()) < 0;
};
// need a prev even if we are skipping stuff
if (i != topologies.begin()) {
prev = std::prev(i);
@@ -563,24 +962,18 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
for (; limit > 0 && i != e; prev = i, ++i) {
auto& [ts, sv] = *i;
if (shard_filter && (prev == e || prev->first != shard_filter->time)) {
shard_start = std::nullopt;
continue;
}
last = std::nullopt;
auto lo = sv.streams.begin();
auto end = sv.streams.end();
// #7409 - shards must be returned in lexicographical order,
std::sort(lo, end, id_cmp);
if (shard_start) {
// find next shard position
lo = std::upper_bound(lo, end, shard_start->id, id_cmp);
shard_start = std::nullopt;
}
if (lo != end && prev != e) {
std::sort(sv.streams.begin(), sv.streams.end(), compare_lexicographically);
if (prev != e) {
// We want older stuff sorted in token order so we can find matching
// token range when determining parent shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), token_cmp);
// token range when determining parent Streams shard.
std::stable_sort(prev->second.streams.begin(), prev->second.streams.end(), compare_by_token);
}
auto expired = [&]() -> std::optional<db_clock::time_point> {
@@ -593,9 +986,29 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
return j->first + confidence_interval(db);
}();
while (lo != end) {
auto& id = *lo++;
std::optional<stream_id_range> shard_range;
if (shard_filter) {
// sanity check - we should never get here as there is if above (`shard_filter && prev == e` => `continue`)
if (prev == e) {
on_internal_error(slogger, fmt::format("Could not find parent generation for shard id {}, got generations [{}]", shard_filter->id, fmt::join(topologies | std::ranges::views::keys, "; ")));
}
const bool uses_tablets = schema->table().uses_tablets();
shard_range = find_children_range_from_parent_token(
prev->second.streams,
i->second.streams,
shard_filter->id,
uses_tablets
);
} else {
shard_range = stream_id_range{ i->second.streams, i->second.streams.begin(), i->second.streams.end() };
}
if (shard_start) {
shard_range->set_starting_position(shard_start->id);
}
shard_range->prepare_for_iterating();
for(const auto &id : *shard_range) {
auto shard = rjson::empty_object();
if (prev != e) {
@@ -620,6 +1033,7 @@ future<executor::request_return_type> executor::describe_stream(client_state& cl
last = std::nullopt;
}
shard_start = std::nullopt;
}
if (last) {
@@ -720,7 +1134,7 @@ struct rapidjson::internal::TypeHelper<ValueType, alternator::shard_iterator_typ
namespace alternator {
future<executor::request_return_type> executor::get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::get_shard_iterator(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.get_shard_iterator++;
auto type = rjson::get<shard_iterator_type>(request, "ShardIteratorType");
@@ -736,18 +1150,22 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
auto stream_arn = rjson::get<alternator::stream_arn>(request, "StreamArn");
auto db = _proxy.data_dictionary();
schema_ptr schema = nullptr;
std::optional<shard_id> sid;
auto schema = get_schema_from_arn(_proxy, stream_arn);
schema_ptr base_schema = nullptr;
try {
auto cf = db.find_column_family(table_id(stream_arn));
schema = cf.schema();
base_schema = cdc::get_base_table(db.real_database(), *schema);
sid = rjson::get<shard_id>(request, "ShardId");
} catch (...) {
}
if (!schema || !cdc::get_base_table(db.real_database(), *schema) || !is_alternator_keyspace(schema->ks_name())) {
if (!schema || !base_schema || !is_alternator_keyspace(schema->ks_name())) {
throw api_error::resource_not_found("Invalid StreamArn");
}
// Uses only node-local context (the metadata) to generate response
maybe_audit(audit_info, audit::statement_category::QUERY, schema->ks_name(),
base_schema->cf_name() + "|" + schema->cf_name(), "GetShardIterator", request);
if (!sid) {
throw api_error::resource_not_found("Invalid ShardId");
}
@@ -776,11 +1194,10 @@ future<executor::request_return_type> executor::get_shard_iterator(client_state&
break;
}
shard_iterator iter(stream_arn, *sid, threshold, inclusive_of_threshold);
shard_iterator iter(schema->id().uuid(), *sid, threshold, inclusive_of_threshold);
auto ret = rjson::empty_object();
rjson::add(ret, "ShardIterator", iter);
return make_ready_future<executor::request_return_type>(rjson::print(std::move(ret)));
}
@@ -823,7 +1240,7 @@ namespace alternator {
};
}
future<executor::request_return_type> executor::get_records(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::get_records(client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.get_records++;
auto start_time = std::chrono::steady_clock::now();
@@ -849,16 +1266,17 @@ future<executor::request_return_type> executor::get_records(client_state& client
if (!schema || !base || !is_alternator_keyspace(schema->ks_name())) {
co_return api_error::resource_not_found(fmt::to_string(iter.table));
}
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
maybe_audit(audit_info, audit::statement_category::QUERY, schema->ks_name(),
base->cf_name() + "|" + schema->cf_name(), "GetRecords", request, cl);
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
partition_key pk = iter.shard.id.to_partition_key(*schema);
dht::partition_range_vector partition_ranges{ dht::partition_range::make_singular(dht::decorate_key(*schema, pk)) };
auto high_ts = db_clock::now() - confidence_interval(db);
auto high_uuid = utils::UUID_gen::min_time_UUID(high_ts.time_since_epoch());
auto lo = clustering_key_prefix::from_exploded(*schema, { iter.threshold.serialize() });
@@ -938,17 +1356,17 @@ future<executor::request_return_type> executor::get_records(client_state& client
auto& metadata = result_set->get_metadata();
auto op_index = std::distance(metadata.get_names().begin(),
auto op_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == op_column_name;
})
);
auto ts_index = std::distance(metadata.get_names().begin(),
auto ts_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == timestamp_column_name;
})
);
auto eor_index = std::distance(metadata.get_names().begin(),
auto eor_index = std::distance(metadata.get_names().begin(),
std::find_if(metadata.get_names().begin(), metadata.get_names().end(), [](const lw_shared_ptr<cql3::column_specification>& cdef) {
return cdef->name->name() == eor_column_name;
})
@@ -993,19 +1411,19 @@ future<executor::request_return_type> executor::get_records(client_state& client
/**
* We merge rows with same timestamp into a single event.
* This is pretty much needed, because a CDC row typically
* encodes ~half the info of an alternator write.
*
* encodes ~half the info of an alternator write.
*
* A big, big downside to how alternator records are written
* (i.e. CQL), is that the distinction between INSERT and UPDATE
* is somewhat lost/unmappable to actual eventName.
* is somewhat lost/unmappable to actual eventName.
* A write (currently) always looks like an insert+modify
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* regardless whether we wrote existing record or not.
*
* Maybe RMW ops could be done slightly differently so
* we can distinguish them here...
*
*
* For now, all writes will become MODIFY.
*
*
* Note: we do not check the current pre/post
* flags on CDC log, instead we use data to
* drive what is returned. This is (afaict)
@@ -1084,9 +1502,15 @@ future<executor::request_return_type> executor::get_records(client_state& client
}
// ugh. figure out if we are and end-of-shard
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
db_clock::time_point ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });
db_clock::time_point ts;
if (schema->table().uses_tablets()) {
ts = co_await _system_keyspace.read_cdc_for_tablets_current_generation_timestamp(base->ks_name(), base->cf_name());
} else {
auto normal_token_owners = _proxy.get_token_metadata_ptr()->count_normal_token_owners();
ts = co_await _sdks.cdc_current_generation_timestamp({ normal_token_owners });
}
auto& shard = iter.shard;
if (shard.time < ts && ts < high_ts) {
@@ -1122,6 +1546,7 @@ bool executor::add_stream_options(const rjson::value& stream_specification, sche
cdc::options opts;
opts.enabled(true);
opts.tablet_merge_blocked(true);
// cdc::delta_mode is ignored by Alternator, so aim for the least overhead.
opts.set_delta_mode(cdc::delta_mode::keys);
opts.ttl(std::chrono::duration_cast<std::chrono::seconds>(dynamodb_streams_max_window).count());
@@ -1156,24 +1581,30 @@ void executor::supplement_table_stream_info(rjson::value& descr, const schema& s
if (opts.enabled()) {
auto db = sp.data_dictionary();
auto cf = db.find_table(schema.ks_name(), cdc::log_name(schema.cf_name()));
stream_arn arn(cf.schema()->id());
stream_arn arn(cf.schema(), cdc::get_base_table(db.real_database(), *cf.schema()));
rjson::add(descr, "LatestStreamArn", arn);
rjson::add(descr, "LatestStreamLabel", rjson::from_string(stream_label(*cf.schema())));
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
auto mode = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
mode = stream_view_type::NEW_AND_OLD_IMAGES;
} else if (opts.preimage()) {
mode = stream_view_type::OLD_IMAGE;
} else if (opts.postimage()) {
mode = stream_view_type::NEW_IMAGE;
}
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
} else if (!opts.enable_requested()) {
return;
}
// For both enabled() and enable_requested():
// DynamoDB returns StreamEnabled=true in StreamSpecification even when
// the stream status is ENABLING (not yet fully active). We mirror this
// behavior: enable_requested means the user asked for streams but CDC
// is not yet finalized, so we still report StreamEnabled=true.
auto stream_desc = rjson::empty_object();
rjson::add(stream_desc, "StreamEnabled", true);
auto mode = stream_view_type::KEYS_ONLY;
if (opts.preimage() && opts.postimage()) {
mode = stream_view_type::NEW_AND_OLD_IMAGES;
} else if (opts.preimage()) {
mode = stream_view_type::OLD_IMAGE;
} else if (opts.postimage()) {
mode = stream_view_type::NEW_IMAGE;
}
rjson::add(stream_desc, "StreamViewType", mode);
rjson::add(descr, "StreamSpecification", std::move(stream_desc));
}
} // namespace alternator

62
alternator/streams.hh Normal file
View File

@@ -0,0 +1,62 @@
/*
* Copyright 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "utils/chunked_vector.hh"
#include "cdc/generation.hh"
#include <generator>
namespace cdc {
class stream_id;
}
namespace alternator {
class stream_id_range {
// helper class for manipulating (possibly wrapped around) range of stream_ids
// it holds one or two ranges [lo1, end1) and [lo2, end2)
// if the range doesn't wrap around, then lo2 == end2 == items.end()
// if the range wraps around, then
// `lo1 == items.begin() and end2 == items.end()` must be true
// the object doesn't own `items`, but it does manipulate it - it will
// reorder elements (so both ranges were next to each other) and sort them by unsigned comparison
// usage - create an object with needed ranges. before iteration call `prepare_for_iterating` method -
// it will reorder elements of `items` array to what is needed and then call begin / end pair.
// note - `items` array will be modified - elements will be reordered, but no elements will be added or removed.
// `items` array must stay intact as long as iteration is in progress.
utils::chunked_vector<cdc::stream_id>::iterator _lo1 = {}, _end1 = {}, _lo2 = {}, _end2 = {};
const cdc::stream_id* _skip_to = nullptr;
bool _prepared = false;
public:
stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1);
stream_id_range(
utils::chunked_vector<cdc::stream_id> &items,
utils::chunked_vector<cdc::stream_id>::iterator lo1,
utils::chunked_vector<cdc::stream_id>::iterator end1,
utils::chunked_vector<cdc::stream_id>::iterator lo2,
utils::chunked_vector<cdc::stream_id>::iterator end2);
void set_starting_position(const cdc::stream_id &update_to);
// Must be called after construction and after set_starting_position()
// (if used), but before begin()/end() iteration.
void prepare_for_iterating();
utils::chunked_vector<cdc::stream_id>::iterator begin() const { return _lo1; }
utils::chunked_vector<cdc::stream_id>::iterator end() const { return _end1; }
};
stream_id_range find_children_range_from_parent_token(
const utils::chunked_vector<cdc::stream_id>& parent_streams,
utils::chunked_vector<cdc::stream_id>& current_streams,
cdc::stream_id parent,
bool uses_tablets
);
}

View File

@@ -44,6 +44,7 @@
#include "cql3/query_options.hh"
#include "cql3/column_identifier.hh"
#include "alternator/executor.hh"
#include "alternator/executor_util.hh"
#include "alternator/controller.hh"
#include "alternator/serialization.hh"
#include "alternator/ttl_tag.hh"
@@ -58,13 +59,17 @@ static logging::logger tlogger("alternator_ttl");
namespace alternator {
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::update_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.update_time_to_live++;
if (!_proxy.features().alternator_ttl) {
co_return api_error::unknown_operation("UpdateTimeToLive not yet supported. Upgrade all nodes to a version that supports it.");
}
schema_ptr schema = get_table(_proxy, request);
maybe_audit(audit_info, audit::statement_category::DDL,
schema->ks_name(), schema->cf_name(), "UpdateTimeToLive", request);
rjson::value* spec = rjson::find(request, "TimeToLiveSpecification");
if (!spec || !spec->IsObject()) {
co_return api_error::validation("UpdateTimeToLive missing mandatory TimeToLiveSpecification");
@@ -114,9 +119,13 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
co_return rjson::print(std::move(response));
}
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request) {
future<executor::request_return_type> executor::describe_time_to_live(client_state& client_state, service_permit permit, rjson::value request, std::unique_ptr<audit::audit_info_alternator>& audit_info) {
_stats.api_operations.describe_time_to_live++;
schema_ptr schema = get_table(_proxy, request);
maybe_audit(audit_info, audit::statement_category::QUERY,
schema->ks_name(), schema->cf_name(), "DescribeTimeToLive", request);
std::map<sstring, sstring> tags_map = get_tags_of_table_or_throw(schema);
rjson::value desc = rjson::empty_object();
auto i = tags_map.find(TTL_TAG_KEY);

View File

@@ -82,15 +82,16 @@ void set_config(std::shared_ptr < api_registry_builder20 > rb, http_context& ctx
});
});
cs::find_config_id.set(r, [&cfg] (const_req r) {
auto id = r.get_path_param("id");
for (auto&& cfg_ref : cfg.values()) {
auto&& cfg = cfg_ref.get();
if (id == cfg.name()) {
return cfg.value_as_json();
}
cs::find_config_id.set(r, [&cfg] (std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto id = req->get_path_param("id");
auto value = co_await cfg.value_as_json_string_for_name(id);
if (!value) {
throw bad_param_exception(sstring("No such config entry: ") + id);
}
throw bad_param_exception(sstring("No such config entry: ") + id);
//value is already a json string
json::json_return_type ret{json::json_void()};
ret._res = std::move(*value);
co_return ret;
});
sp::get_rpc_timeout.set(r, [&cfg](const_req req) {

View File

@@ -123,12 +123,13 @@ static future<json::json_return_type> sum_estimated_histogram(sharded<service::
});
}
static future<json::json_return_type> sum_estimated_histogram(sharded<service::storage_proxy>& proxy, utils::estimated_histogram service::storage_proxy_stats::stats::*f) {
static future<json::json_return_type> sum_estimated_histogram(sharded<service::storage_proxy>& proxy, service::storage_proxy_stats::cas_contention_histogram service::storage_proxy_stats::stats::*f) {
return two_dimensional_map_reduce(proxy, f, utils::estimated_histogram_merge,
utils::estimated_histogram()).then([](const utils::estimated_histogram& val) {
return two_dimensional_map_reduce(proxy, f, utils::estimated_histogram_with_max_merge<service::storage_proxy_stats::cas_contention_histogram::MAX>,
service::storage_proxy_stats::cas_contention_histogram()).then([](const service::storage_proxy_stats::cas_contention_histogram& val) {
utils_json::estimated_histogram res;
res = val;
res.bucket_offsets = val.get_buckets_offsets();
res.buckets = val.get_buckets_counts();
return make_ready_future<json::json_return_type>(res);
});
}

View File

@@ -1743,11 +1743,11 @@ rest_get_vnode_tablet_migration(http_context& ctx, sharded<service::storage_serv
throw std::runtime_error("vnodes-to-tablets migration requires all nodes to support the VNODES_TO_TABLETS_MIGRATIONS cluster feature");
}
auto keyspace = validate_keyspace(ctx, req);
auto status = co_await ss.local().get_tablets_migration_status(keyspace);
auto status = co_await ss.local().get_tablets_migration_status_with_node_details(keyspace);
ss::vnode_tablet_migration_status result;
result.keyspace = status.keyspace;
result.status = status.status;
result.status = fmt::format("{}", status.status);
result.nodes._set = true;
for (const auto& node : status.nodes) {
ss::vnode_tablet_migration_node_status n;

View File

@@ -126,6 +126,13 @@ static std::map<sstring, std::set<sstring>> parse_audit_tables(const sstring& da
}
boost::trim(parts[0]);
boost::trim(parts[1]);
// The real keyspace name of an Alternator table T is
// "alternator_T". The audit_tables config flag uses the format
// "alternator.T" to refer to such tables, so we expand it here
// to the real keyspace name.
if (parts[0] == "alternator") {
parts[0] = "alternator_" + parts[1];
}
result[parts[0]].insert(std::move(parts[1]));
}
}
@@ -228,27 +235,55 @@ future<> audit::shutdown() {
return make_ready_future<>();
}
future<> audit::log(const audit_info* audit_info, service::query_state& query_state, const cql3::query_options& options, bool error) {
const service::client_state& client_state = query_state.get_client_state();
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
db::consistency_level cl = options.get_consistency();
future<> audit::log(const audit_info& audit_info, const service::client_state& client_state, std::optional<db::consistency_level> cl, bool error) {
thread_local static sstring no_username("undefined");
static const sstring anonymous_username("anonymous");
const sstring& username = client_state.user() ? client_state.user()->name.value_or(anonymous_username) : no_username;
socket_address client_ip = client_state.get_client_address().addr();
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (logger.is_enabled(logging::log_level::debug)) {
logger.debug("Log written: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {}",
node_ip, audit_info->category_string(), cl, error, audit_info->keyspace(),
audit_info->query(), client_ip, audit_info->table(), username);
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
audit_info.query(), client_ip, audit_info.table(), username);
}
return futurize_invoke(std::mem_fn(&storage_helper::write), _storage_helper_ptr, audit_info, node_ip, client_ip, cl, username, error)
return futurize_invoke(std::mem_fn(&storage_helper::write), _storage_helper_ptr, &audit_info, node_ip, client_ip, cl, username, error)
.handle_exception([audit_info, node_ip, client_ip, cl, username, error] (auto ep) {
logger.error("Unexpected exception when writing log with: node_ip {} category {} cl {} error {} keyspace {} query '{}' client_ip {} table {} username {} exception {}",
node_ip, audit_info->category_string(), cl, error, audit_info->keyspace(),
audit_info->query(), client_ip, audit_info->table(),username, ep);
node_ip, audit_info.category_string(), cl, error, audit_info.keyspace(),
audit_info.query(), client_ip, audit_info.table(), username, ep);
});
}
static future<> maybe_log(const audit_info& audit_info, const service::client_state& client_state, std::optional<db::consistency_level> cl, bool error) {
if(audit::audit_instance().local_is_initialized() && audit::local_audit_instance().should_log(audit_info)) {
return audit::local_audit_instance().log(audit_info, client_state, cl, error);
}
return make_ready_future<>();
}
static future<> inspect(const audit_info& audit_info, const service::query_state& query_state, const cql3::query_options& options, bool error) {
return maybe_log(audit_info, query_state.get_client_state(), options.get_consistency(), error);
}
future<> inspect(shared_ptr<cql3::cql_statement> statement, const service::query_state& query_state, const cql3::query_options& options, bool error) {
const auto audit_info = statement->get_audit_info();
if (audit_info == nullptr) {
return make_ready_future<>();
}
if (audit_info->batch()) {
cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());
return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {
return inspect(m.statement, query_state, options, error);
});
} else {
return inspect(*audit_info, query_state, options, error);
}
}
future<> inspect(const audit_info_alternator& ai, const service::client_state& client_state, bool error) {
return maybe_log(static_cast<const audit_info&>(ai), client_state, ai.get_cl(), error);
}
future<> audit::log_login(const sstring& username, socket_address client_ip, bool error) noexcept {
socket_address node_ip = _token_metadata.get()->get_topology().my_address().addr();
if (logger.is_enabled(logging::log_level::debug)) {
@@ -262,24 +297,6 @@ future<> audit::log_login(const sstring& username, socket_address client_ip, boo
});
}
future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error) {
auto audit_info = statement->get_audit_info();
if (!audit_info) {
return make_ready_future<>();
}
if (audit_info->batch()) {
cql3::statements::batch_statement* batch = static_cast<cql3::statements::batch_statement*>(statement.get());
return do_for_each(batch->statements().begin(), batch->statements().end(), [&query_state, &options, error] (auto&& m) {
return inspect(m.statement, query_state, options, error);
});
} else {
if (audit::local_audit_instance().should_log(audit_info)) {
return audit::local_audit_instance().log(audit_info, query_state, options, error);
}
return make_ready_future<>();
}
}
future<> inspect_login(const sstring& username, socket_address client_ip, bool error) {
if (!audit::audit_instance().local_is_initialized() || !audit::local_audit_instance().should_log_login()) {
return make_ready_future<>();
@@ -292,13 +309,21 @@ bool audit::should_log_table(const sstring& keyspace, const sstring& name) const
return keyspace_it != _audited_tables.cend() && keyspace_it->second.find(name) != keyspace_it->second.cend();
}
bool audit::should_log(const audit_info* audit_info) const {
return _audited_categories.contains(audit_info->category())
&& (_audited_keyspaces.find(audit_info->keyspace()) != _audited_keyspaces.cend()
|| should_log_table(audit_info->keyspace(), audit_info->table())
|| audit_info->category() == statement_category::AUTH
|| audit_info->category() == statement_category::ADMIN
|| audit_info->category() == statement_category::DCL);
bool audit::should_log(const audit_info& audit_info) const {
return will_log(audit_info.category(), audit_info.keyspace(), audit_info.table());
}
bool audit::will_log(statement_category cat, std::string_view keyspace, std::string_view table) const {
// If keyspace is empty (e.g., ListTables, or batch operations spanning
// multiple tables), the operation cannot be filtered by keyspace/table,
// so it is logged whenever the category matches.
return _audited_categories.contains(cat)
&& (keyspace.empty()
|| _audited_keyspaces.find(sstring(keyspace)) != _audited_keyspaces.cend()
|| should_log_table(sstring(keyspace), sstring(table))
|| cat == statement_category::AUTH
|| cat == statement_category::ADMIN
|| cat == statement_category::DCL);
}
template<class T>

View File

@@ -10,14 +10,15 @@
#include "seastarx.hh"
#include "utils/log.hh"
#include "utils/observable.hh"
#include "db/consistency_level.hh"
#include "locator/token_metadata_fwd.hh"
#include "service/client_state.hh"
#include "db/consistency_level_type.hh"
#include <seastar/core/sharded.hh>
#include <seastar/util/log.hh>
#include "enum_set.hh"
#include <memory>
#include <optional>
namespace db {
@@ -70,12 +71,15 @@ using category_set = enum_set<super_enum<statement_category, statement_category:
statement_category::AUTH,
statement_category::ADMIN>>;
class audit_info final {
// Holds the audit metadata for a single request: the operation category,
// target keyspace/table, and the query string to be logged.
class audit_info {
protected:
statement_category _category;
sstring _keyspace;
sstring _table;
sstring _query;
bool _batch;
bool _batch; // used only for unpacking batches in CQL, not relevant for Alternator
public:
audit_info(statement_category cat, sstring keyspace, sstring table, bool batch)
: _category(cat)
@@ -83,8 +87,17 @@ public:
, _table(std::move(table))
, _batch(batch)
{ }
void set_query_string(const std::string_view& query_string) {
_query = sstring(query_string);
// 'operation' is for the cases where the query string does not contain it, like with Alternator
audit_info& set_query_string(std::string_view query_string, std::string_view operation = {}) {
return set_query_string(sstring(query_string), sstring(operation));
}
audit_info& set_query_string(const sstring& query_string, const sstring& operation = "") {
if(!operation.empty()) {
_query = operation + "|" + query_string;
} else {
_query = query_string;
}
return *this;
}
const sstring& keyspace() const { return _keyspace; }
const sstring& table() const { return _table; }
@@ -96,6 +109,23 @@ public:
using audit_info_ptr = std::unique_ptr<audit_info>;
// Audit info for Alternator requests.
// Unlike CQL, where the consistency level is available from query_options and
// passed separately to audit::log(), Alternator has no query_options, so we
// store the CL inside the audit_info object.
// Consistency level is optional: only data read/write operations (GetItem,
// PutItem, Query, Scan, etc.) have a meaningful CL. Schema operations and
// metadata queries pass std::nullopt.
class audit_info_alternator final : public audit_info {
std::optional<db::consistency_level> _cl;
public:
audit_info_alternator(statement_category cat, sstring keyspace, sstring table, std::optional<db::consistency_level> cl = std::nullopt)
: audit_info(cat, std::move(keyspace), std::move(table), false), _cl(cl)
{}
std::optional<db::consistency_level> get_cl() const { return _cl; }
};
class storage_helper;
class audit final : public seastar::async_sharded_service<audit> {
@@ -142,13 +172,15 @@ public:
future<> start(const db::config& cfg);
future<> stop();
future<> shutdown();
bool should_log(const audit_info* audit_info) const;
bool should_log(const audit_info& audit_info) const;
bool will_log(statement_category cat, std::string_view keyspace = {}, std::string_view table = {}) const;
bool should_log_login() const { return _audited_categories.contains(statement_category::AUTH); }
future<> log(const audit_info* audit_info, service::query_state& query_state, const cql3::query_options& options, bool error);
future<> log(const audit_info& audit_info, const service::client_state& client_state, std::optional<db::consistency_level> cl, bool error);
future<> log_login(const sstring& username, socket_address client_ip, bool error) noexcept;
};
future<> inspect(shared_ptr<cql3::cql_statement> statement, service::query_state& query_state, const cql3::query_options& options, bool error);
future<> inspect(const audit_info_alternator& audit_info, const service::client_state& client_state, bool error);
future<> inspect(shared_ptr<cql3::cql_statement> statement, const service::query_state& query_state, const cql3::query_options& options, bool error);
future<> inspect_login(const sstring& username, socket_address client_ip, bool error);

View File

@@ -38,7 +38,8 @@ audit_cf_storage_helper::audit_cf_storage_helper(cql3::query_processor& qp, serv
"source inet, "
"username text, "
"error boolean, "
"PRIMARY KEY ((date, node), event_time))",
"PRIMARY KEY ((date, node), event_time))"
" WITH caching = {{'keys': 'NONE', 'rows_per_partition': 'NONE', 'enabled': 'false'}}",
KEYSPACE_NAME, TABLE_NAME),
fmt::format("INSERT INTO {}.{} ("
"date,"
@@ -129,7 +130,7 @@ future<> audit_cf_storage_helper::stop() {
future<> audit_cf_storage_helper::write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) {
return _table.insert(_qp, _mm, _dummy_query_state, make_data, audit_info, node_ip, client_ip, cl, username, error);
@@ -145,7 +146,7 @@ future<> audit_cf_storage_helper::write_login(const sstring& username,
cql3::query_options audit_cf_storage_helper::make_data(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) {
auto time = std::chrono::system_clock::now();
@@ -154,7 +155,7 @@ cql3::query_options audit_cf_storage_helper::make_data(const audit_info* audit_i
auto date = millis_since_epoch / ticks_per_day * ticks_per_day;
thread_local static int64_t last_nanos = 0;
auto time_id = utils::UUID_gen::get_time_UUID(table_helper::make_monotonic_UUID_tp(last_nanos, time));
auto consistency_level = fmt::format("{}", cl);
auto consistency_level = cl ? format("{}", *cl) : sstring("");
std::vector<cql3::raw_value> values {
cql3::raw_value::make_value(timestamp_type->decompose(date)),
cql3::raw_value::make_value(inet_addr_type->decompose(node_ip.addr())),

View File

@@ -37,7 +37,7 @@ class audit_cf_storage_helper : public storage_helper {
static cql3::query_options make_data(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error);
static cql3::query_options make_login_data(socket_address node_ip,
@@ -55,7 +55,7 @@ public:
virtual future<> write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) override;
virtual future<> write_login(const sstring& username,

View File

@@ -42,7 +42,7 @@ future<> audit_composite_storage_helper::stop() {
future<> audit_composite_storage_helper::write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) {
return seastar::parallel_for_each(

View File

@@ -25,7 +25,7 @@ public:
virtual future<> write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) override;
virtual future<> write_login(const sstring& username,

View File

@@ -101,18 +101,19 @@ future<> audit_syslog_storage_helper::stop() {
future<> audit_syslog_storage_helper::write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) {
auto now = std::chrono::system_clock::to_time_t(std::chrono::system_clock::now());
tm time;
localtime_r(&now, &time);
auto cl_str = cl ? format("{}", *cl) : sstring("");
sstring msg = seastar::format(R"(<{}>{:%h %e %T} scylla-audit: node="{}", category="{}", cl="{}", error="{}", keyspace="{}", query="{}", client_ip="{}", table="{}", username="{}")",
LOG_NOTICE | LOG_USER,
time,
node_ip,
audit_info->category_string(),
cl,
cl_str,
(error ? "true" : "false"),
audit_info->keyspace(),
json_escape(audit_info->query()),

View File

@@ -35,7 +35,7 @@ public:
virtual future<> write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) override;
virtual future<> write_login(const sstring& username,

View File

@@ -22,7 +22,7 @@ public:
virtual future<> write(const audit_info* audit_info,
socket_address node_ip,
socket_address client_ip,
db::consistency_level cl,
std::optional<db::consistency_level> cl,
const sstring& username,
bool error) = 0;
virtual future<> write_login(const sstring& username,

View File

@@ -31,6 +31,8 @@ namespace {
logger mylog{"ldap_role_manager"}; // `log` is taken by math.
constexpr std::string_view user_placeholder = "{USER}";
struct url_desc_deleter {
void operator()(LDAPURLDesc *p) {
ldap_free_urldesc(p);
@@ -39,9 +41,141 @@ struct url_desc_deleter {
using url_desc_ptr = std::unique_ptr<LDAPURLDesc, url_desc_deleter>;
url_desc_ptr parse_url(std::string_view url) {
/// Escapes LDAP filter assertion value per RFC 4515 Section 3.
/// The characters *, (, ), \, and NUL must be backslash-hex-escaped
/// to prevent filter injection when interpolating untrusted input.
sstring escape_filter_value(std::string_view value) {
size_t escapable_chars = 0;
for (unsigned char ch : value) {
switch (ch) {
case '*':
case '(':
case ')':
case '\\':
case '\0':
++escapable_chars;
break;
default:
break;
}
}
if (escapable_chars == 0) {
return sstring(value);
}
sstring escaped(value.size() + escapable_chars * 2, 0);
size_t pos = 0;
for (unsigned char ch : value) {
switch (ch) {
case '*':
escaped[pos++] = '\\';
escaped[pos++] = '2';
escaped[pos++] = 'a';
break;
case '(':
escaped[pos++] = '\\';
escaped[pos++] = '2';
escaped[pos++] = '8';
break;
case ')':
escaped[pos++] = '\\';
escaped[pos++] = '2';
escaped[pos++] = '9';
break;
case '\\':
escaped[pos++] = '\\';
escaped[pos++] = '5';
escaped[pos++] = 'c';
break;
case '\0':
escaped[pos++] = '\\';
escaped[pos++] = '0';
escaped[pos++] = '0';
break;
default:
escaped[pos++] = static_cast<char>(ch);
break;
}
}
return escaped;
}
/// Percent-encodes characters that are not RFC 3986 "unreserved"
/// (ALPHA / DIGIT / '-' / '.' / '_' / '~').
///
/// Uses explicit ASCII range checks instead of std::isalnum() because
/// the latter is locale-dependent and could pass non-ASCII characters
/// through unencoded under certain locale settings.
///
/// This is applied AFTER RFC 4515 filter escaping when the value is
/// substituted into an LDAP URL. It serves two purposes:
/// 1. Prevents URL-level metacharacters ('?', '#') from breaking
/// the URL structure parsed by ldap_url_parse.
/// 2. Prevents percent-decoding (which ldap_url_parse performs on
/// each component) from undoing the filter escaping, e.g. a
/// literal "%2a" in the username would otherwise decode to '*'.
sstring percent_encode_for_url(std::string_view value) {
static constexpr char hex[] = "0123456789ABCDEF";
size_t chars_to_encode = 0;
for (unsigned char ch : value) {
if (!((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9')
|| ch == '-' || ch == '.' || ch == '_' || ch == '~')) {
++chars_to_encode;
}
}
if (chars_to_encode == 0) {
return sstring(value);
}
sstring encoded(value.size() + chars_to_encode * 2, 0);
size_t pos = 0;
for (unsigned char ch : value) {
if ((ch >= 'A' && ch <= 'Z') || (ch >= 'a' && ch <= 'z') || (ch >= '0' && ch <= '9')
|| ch == '-' || ch == '.' || ch == '_' || ch == '~') {
encoded[pos++] = static_cast<char>(ch);
} else {
encoded[pos++] = '%';
encoded[pos++] = hex[ch >> 4];
encoded[pos++] = hex[ch & 0x0F];
}
}
return encoded;
}
/// Checks whether \p sentinel appears in any parsed URL component
/// other than the filter (host, DN, attributes, extensions).
bool sentinel_outside_filter(const LDAPURLDesc& desc, std::string_view sentinel) {
auto contains = [&](const char* field) {
return field && std::string_view(field).find(sentinel) != std::string_view::npos;
};
if (contains(desc.lud_host) || contains(desc.lud_dn)) {
return true;
}
if (desc.lud_attrs) {
for (int i = 0; desc.lud_attrs[i]; ++i) {
if (contains(desc.lud_attrs[i])) {
return true;
}
}
}
if (desc.lud_exts) {
for (int i = 0; desc.lud_exts[i]; ++i) {
if (contains(desc.lud_exts[i])) {
return true;
}
}
}
return false;
}
url_desc_ptr parse_url(const sstring& url) {
LDAPURLDesc *desc = nullptr;
if (ldap_url_parse(url.data(), &desc)) {
if (ldap_url_parse(url.c_str(), &desc)) {
mylog.error("error in ldap_url_parse({})", url);
}
return url_desc_ptr(desc);
@@ -112,6 +246,7 @@ const resource_set& ldap_role_manager::protected_resources() const {
}
future<> ldap_role_manager::start() {
validate_query_template();
if (!parse_url(get_url("dummy-user"))) { // Just need host and port -- any user should do.
return make_exception_future(
std::runtime_error(fmt::format("error getting LDAP server address from template {}", _query_template)));
@@ -216,7 +351,7 @@ future<> ldap_role_manager::revoke(std::string_view, std::string_view, ::service
}
future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name, recursive_role_query) {
const auto url = get_url(grantee_name.data());
const auto url = get_url(grantee_name);
auto desc = parse_url(url);
if (!desc) {
return make_exception_future<role_set>(std::runtime_error(format("Error parsing URL {}", url)));
@@ -348,7 +483,46 @@ future<> ldap_role_manager::remove_attribute(std::string_view role_name, std::st
}
sstring ldap_role_manager::get_url(std::string_view user) const {
return boost::replace_all_copy(_query_template, "{USER}", user);
// Two-layer encoding protects against injection:
// 1. RFC 4515 filter escaping neutralizes filter metacharacters (*, (, ), \, NUL)
// 2. URL percent-encoding prevents URL structure injection (?, #) and blocks
// ldap_url_parse's percent-decoding from undoing the filter escaping (%2a -> *)
return boost::replace_all_copy(_query_template, user_placeholder,
percent_encode_for_url(escape_filter_value(user)));
}
void ldap_role_manager::validate_query_template() const {
if (_query_template.find(user_placeholder) == sstring::npos) {
return;
}
// Substitute {USER} with a sentinel and let ldap_url_parse tell us
// which URL component it landed in. The sentinel is purely
// alphanumeric so it cannot affect URL parsing.
static constexpr std::string_view sentinel = "XLDAPSENTINELX";
sstring test_url = boost::replace_all_copy(_query_template, user_placeholder, sentinel);
auto desc = parse_url(test_url);
if (!desc) {
throw url_error(format("LDAP URL template is not a valid URL when {{USER}} is substituted: {}", _query_template));
}
// The sentinel must appear in the filter ...
if (!desc->lud_filter
|| std::string_view(desc->lud_filter).find(sentinel) == std::string_view::npos) {
throw url_error(format(
"LDAP URL template places {{USER}} outside the filter component. "
"RFC 4515 filter escaping only protects the filter; other components "
"(e.g. the base DN) require different escaping and are not supported. "
"Template: {}", _query_template));
}
// ... and nowhere else (host, DN, attributes, extensions).
if (sentinel_outside_filter(*desc, sentinel)) {
throw url_error(format(
"LDAP URL template places {{USER}} outside the filter component. "
"RFC 4515 filter escaping only protects the filter; other components "
"(e.g. the host) require different escaping and are not supported. "
"Template: {}", _query_template));
}
}
future<std::vector<cql3::description>> ldap_role_manager::describe_role_grants() {

View File

@@ -115,6 +115,9 @@ class ldap_role_manager : public role_manager {
/// Macro-expands _query_template, returning the result.
sstring get_url(std::string_view user) const;
/// Validates that {USER}, if present, is used only in the LDAP filter component.
void validate_query_template() const;
/// Used to auto-create roles returned by ldap.
future<> create_role(std::string_view role_name);

View File

@@ -35,6 +35,15 @@ enum class image_mode : uint8_t {
class options final {
std::optional<bool> _enabled;
bool _enable_requested = false;
// When CDC is employed for the purpose of Alternator Streams and tablets are used,
// tablet merges need to be blocked due to limitations of DynamoDB Streams API.
// DynamoDB Streams allows to specify a single parent for a stream.
// In ScyllaDB, there is a one-to-one association between streams and tablets,
// so merging tablets means also merging streams. A merged stream has two parents and both
// need to be done reading from before reading from the newly merged tablet. This is impossible
// to be conveyed with DynamoDB Streams API and the result can be reordering of events in Streams.
bool _tablet_merge_blocked = false;
image_mode _preimage = image_mode::off;
bool _postimage = false;
delta_mode _delta_mode = delta_mode::full;
@@ -48,6 +57,8 @@ public:
bool enabled() const { return _enabled.value_or(false); }
bool is_enabled_set() const { return _enabled.has_value(); }
bool enable_requested() const { return _enable_requested; }
bool tablet_merge_blocked() const { return _tablet_merge_blocked; }
bool preimage() const { return _preimage != image_mode::off; }
bool full_preimage() const { return _preimage == image_mode::full; }
bool postimage() const { return _postimage; }
@@ -56,6 +67,17 @@ public:
int ttl() const { return _ttl; }
void enabled(bool b) { _enabled = b; }
// For the cases when enabling cannot be immediately enforced, like with Alternator Streams
// which is incompatible with tablet merges, we need to be able to defer actual enablement
// until any in-progress tablet merges complete. We expect that finalization happens
// promptly: on_update_column_family callback in topology_coordinator.cc wakes up
// the topology coordinator to run maybe_finalize_pending_stream_enables shortly
// after the DDL. However, there is SCYLLADB-1304
void enable_requested(bool b = true) { _enable_requested = b; }
// Persistent flag checked by the tablet allocator to suppress new merge
// decisions. Always set when Alternator Streams are enabled; inert on
// vnode tables.
void tablet_merge_blocked(bool b = true) { _tablet_merge_blocked = b; }
void preimage(bool b) { preimage(b ? image_mode::on : image_mode::off); }
void preimage(image_mode m) { _preimage = m; }
void postimage(bool b) { _postimage = b; }

View File

@@ -16,8 +16,11 @@
#include "keys/keys.hh"
#include "replica/database.hh"
#include "db/system_keyspace.hh"
#include "db/schema_tables.hh"
#include "dht/token-sharding.hh"
#include "locator/token_metadata.hh"
#include "locator/tablets.hh"
#include "schema/schema_builder.hh"
#include "types/set.hh"
#include "utils/assert.hh"
#include "utils/error_injection.hh"
@@ -29,6 +32,7 @@
#include "cdc/cdc_options.hh"
#include "cdc/generation_service.hh"
#include "cdc/log.hh"
#include "service/migration_listener.hh"
extern logging::logger cdc_log;
@@ -776,4 +780,59 @@ future<> generation_service::garbage_collect_cdc_streams(utils::chunked_vector<c
}
}
future<utils::chunked_vector<canonical_mutation>> generation_service::maybe_finalize_pending_stream_enables(const locator::token_metadata& tm, api::timestamp_type ts) {
utils::chunked_vector<canonical_mutation> muts;
if (utils::get_local_injector().enter("delay_cdc_stream_finalization")) {
co_return std::move(muts);
}
co_await _db.get_tables_metadata().for_each_table_gently([&] (table_id id, lw_shared_ptr<replica::table> t) -> future<> {
auto s = t->schema();
if (!s->cdc_options().enable_requested()) {
co_return;
}
// Only tablet tables can have enable_requested set
if (!tm.tablets().has_tablet_map(id)) {
co_return;
}
auto& tmap = tm.tablets().get_tablet_map(id);
if (tmap.needs_merge()) {
cdc_log.debug("Table {}.{}: deferring stream enablement, tablet merge still in progress", s->ks_name(), s->cf_name());
co_return;
}
cdc_log.info("Table {}.{}: finalizing deferred stream enablement (no in-progress merges)", s->ks_name(), s->cf_name());
// Build a new schema with enabled=true, enable_requested=false
schema_builder builder(s);
cdc::options new_opts = s->cdc_options();
new_opts.enabled(true);
new_opts.enable_requested(false);
new_opts.tablet_merge_blocked(true);
builder.with_cdc_options(new_opts);
auto new_schema = builder.build();
// Generate the schema mutation (table metadata update only, no columns/indices changed)
utils::chunked_vector<mutation> schema_muts;
db::schema_tables::add_table_or_view_to_schema_mutation(new_schema, ts, false, schema_muts);
// Trigger the CDC migration listener hook which creates the CDC log table.
// This runs on_before_update_column_family listeners (including CDC's own
// listener that creates/updates the log table schema).
co_await seastar::async([&] {
_db.get_notifier().before_update_column_family(*new_schema, *s, schema_muts, ts);
});
for (auto& m : schema_muts) {
muts.emplace_back(canonical_mutation(m));
co_await coroutine::maybe_yield();
}
});
co_return std::move(muts);
}
} // namespace cdc

View File

@@ -18,6 +18,7 @@ class system_keyspace;
namespace locator {
class tablet_map;
class token_metadata;
}
namespace cdc {
@@ -64,6 +65,12 @@ public:
future<> generate_tablet_resize_update(utils::chunked_vector<canonical_mutation>& muts, table_id table, const locator::tablet_map& new_tablet_map, api::timestamp_type ts);
// Check for tables with enable_requested CDC option and finalize their
// stream enablement if no in-progress tablet merges remain.
// Returns schema mutations that transition enable_requested -> enabled,
// including CDC log table creation side effects.
future<utils::chunked_vector<canonical_mutation>> maybe_finalize_pending_stream_enables(const locator::token_metadata& tm, api::timestamp_type ts);
future<utils::chunked_vector<mutation>> garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts);
future<> garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts);

View File

@@ -8,7 +8,7 @@
#include <utility>
#include <algorithm>
#include <unordered_set>
#include <boost/range/irange.hpp>
#include <seastar/core/thread.hh>
#include <seastar/core/metrics.hh>
@@ -47,6 +47,7 @@
#include "tracing/trace_state.hh"
#include "stats.hh"
#include "utils/labels.hh"
#include "alternator/executor.hh"
namespace std {
@@ -195,7 +196,7 @@ public:
for (auto sp : cfms) {
const auto& schema = *sp;
if (!schema.cdc_options().enabled()) {
if (!cdc_enabled(schema)) {
continue;
}
@@ -464,6 +465,18 @@ cdc::options::options(const std::map<sstring, sstring>& map) {
if (_ttl < 0) {
throw exceptions::configuration_exception("Invalid CDC option: ttl must be >= 0");
}
} else if (key == "enable_requested") {
if (is_true || is_false) {
_enable_requested = is_true;
} else {
throw exceptions::configuration_exception("Invalid value for CDC option \"enable_requested\": " + p.second);
}
} else if (key == "tablet_merge_blocked") {
if (is_true || is_false) {
_tablet_merge_blocked = is_true;
} else {
throw exceptions::configuration_exception("Invalid value for CDC option \"tablet_merge_blocked\": " + p.second);
}
} else {
throw exceptions::configuration_exception("Invalid CDC option: " + p.first);
}
@@ -471,7 +484,7 @@ cdc::options::options(const std::map<sstring, sstring>& map) {
}
std::map<sstring, sstring> cdc::options::to_map() const {
if (!is_enabled_set()) {
if (!is_enabled_set() && !_enable_requested) {
return {};
}
@@ -481,6 +494,8 @@ std::map<sstring, sstring> cdc::options::to_map() const {
{ "postimage", _postimage ? "true" : "false" },
{ "delta", fmt::format("{}", _delta_mode) },
{ "ttl", std::to_string(_ttl) },
{ "enable_requested", enable_requested() ? "true" : "false" },
{ "tablet_merge_blocked", _tablet_merge_blocked ? "true" : "false" },
};
}
@@ -489,7 +504,9 @@ sstring cdc::options::to_sstring() const {
}
bool cdc::options::operator==(const options& o) const {
return enabled() == o.enabled() && _preimage == o._preimage && _postimage == o._postimage && _ttl == o._ttl
return enabled() == o.enabled() && enable_requested() == o.enable_requested()
&& _tablet_merge_blocked == o._tablet_merge_blocked
&& _preimage == o._preimage && _postimage == o._postimage && _ttl == o._ttl
&& _delta_mode == o._delta_mode;
}
@@ -1068,6 +1085,14 @@ public:
return create_ck(_batch_no - 1);
}
api::timestamp_type get_timestamp() const {
return _ts;
}
ttl_opt get_ttl() const {
return _ttl;
}
// A common pattern is to allocate a row and then immediately set its `cdc$operation` column.
clustering_key allocate_new_log_row(operation op) {
auto log_ck = allocate_new_log_row();
@@ -1209,15 +1234,25 @@ struct process_row_visitor {
row_states_map& _clustering_row_states;
const bool _generate_delta_values = true;
// true if we are processing changes that were produced by Alternator
const bool _alternator;
// will be set to true, if any kind of change in row will be detected. Used only, when processing Alternator's changes.
bool _alternator_any_value_changed = false;
// will be set to true, if Alternator's collection column (:attrs) will be modified only by removing elements
// Used only, when processing Alternator's changes.
bool _alternator_only_deletes = false;
process_row_visitor(
const clustering_key& log_ck, stats::part_type_set& touched_parts, log_mutation_builder& builder,
bool enable_updating_state, const clustering_key* base_ck, cell_map* row_state,
row_states_map& clustering_row_states, bool generate_delta_values)
row_states_map& clustering_row_states, bool generate_delta_values, bool alternator = false)
: _log_ck(log_ck), _touched_parts(touched_parts), _builder(builder),
_enable_updating_state(enable_updating_state), _base_ck(base_ck), _row_state(row_state),
_clustering_row_states(clustering_row_states),
_generate_delta_values(generate_delta_values)
_generate_delta_values(generate_delta_values), _alternator(alternator)
{}
void update_row_state(const column_definition& cdef, managed_bytes_opt value) {
@@ -1227,7 +1262,17 @@ struct process_row_visitor {
auto [it, _] = _clustering_row_states.try_emplace(*_base_ck);
_row_state = &it->second;
}
(*_row_state)[&cdef] = std::move(value);
auto [ it, inserted ] = _row_state->insert({ &cdef, std::nullopt });
// we ignore `_alternator_any_value_changed` for non-alternator changes.
// we don't filter if `_enable_updating_state` is false, as on top of needing pre image
// we also need cdc to build post image for us
// we add check for `_alternator` here for performance reasons - no point in byte compare objects
// if the return value will be ignored
if (_alternator && _enable_updating_state) {
_alternator_any_value_changed = _alternator_any_value_changed || it->second != value;
}
it->second = std::move(value);
}
void live_atomic_cell(const column_definition& cdef, const atomic_cell_view& cell) {
@@ -1377,6 +1422,8 @@ struct process_row_visitor {
auto&& deleted_keys = std::get<1>(result);
auto&& added_cells = std::get<2>(result);
_alternator_only_deletes = cdef.name_as_text() == alternator::executor::ATTRS_COLUMN_NAME && !deleted_keys.empty() && !added_cells.has_value();
// FIXME: we're doing redundant work: first we serialize the set of deleted keys into a blob,
// then we deserialize again when merging images below
managed_bytes_opt deleted_elements = std::nullopt;
@@ -1434,12 +1481,31 @@ struct process_change_visitor {
const bool _enable_updating_state = false;
row_states_map& _clustering_row_states;
// clustering keys' as bytes of rows that should be ignored, when writing cdc log changes
// filtering will be done in `clean_up_noop_rows` function. Used only, when processing Alternator's changes.
// Since Alternator clustering key is always at most single column, we store unpacked clustering key.
// If Alternator table is without clustering key, that means partition has at most one row, any value present
// in _alternator_clustering_keys_to_ignore will make us ignore that single row -
// we will use an empty bytes object.
std::unordered_set<bytes>& _alternator_clustering_keys_to_ignore;
cell_map& _static_row_state;
const bool _alternator_schema_has_no_clustering_key = false;
const bool _is_update = false;
const bool _generate_delta_values = true;
// only called, when processing Alternator's change
void alternator_add_ckey_to_rows_to_ignore(const clustering_key& ckey) {
throwing_assert(_request_options.alternator);
auto res = ckey.explode();
auto ckey_exploded = !res.empty() ? res[0] : bytes{};
_alternator_clustering_keys_to_ignore.insert(ckey_exploded);
}
void static_row_cells(auto&& visit_row_cells) {
_touched_parts.set<stats::part_type::STATIC_ROW>();
@@ -1471,16 +1537,29 @@ struct process_change_visitor {
}
};
auto row_state = get_row_state(_clustering_row_states, ckey);
clustering_row_cells_visitor v(
log_ck, _touched_parts, _builder,
_enable_updating_state, &ckey, get_row_state(_clustering_row_states, ckey),
_clustering_row_states, _generate_delta_values);
_enable_updating_state, &ckey, row_state,
_clustering_row_states, _generate_delta_values, _request_options.alternator);
if (_is_update && _request_options.alternator) {
v._marker_op = operation::update;
v._marker_op = row_state ? operation::update : operation::insert;
}
visit_row_cells(v);
if (_enable_updating_state) {
if (_request_options.alternator && !v._alternator_any_value_changed) {
// we need additional checks here:
// - without `row_state != nullptr` inserting new key without additional fields (so only partition / clustering key) would be
// treated as no-change, because without additional fields given by the user `v` visitor won't visit any cells
// and _alternator_any_value_changed will be false (thus item will be skipped),
// - without `row_state == nullptr && v._alternator_only_deletes` check we won't properly ignore
// column deletes for existing items, but without the column we want to delete -
// item exists (so row_state != nullptr), but we delete non-existing column, so no-op
if (row_state != nullptr || (row_state == nullptr && v._alternator_only_deletes)) {
alternator_add_ckey_to_rows_to_ignore(ckey);
}
}
// #7716: if there are no regular columns, our visitor would not have visited any cells,
// hence it would not have created a row_state for this row. In effect, postimage wouldn't be produced.
// Ensure that the row state exists.
@@ -1497,8 +1576,12 @@ struct process_change_visitor {
auto log_ck = _builder.allocate_new_log_row(_row_delete_op);
_builder.set_clustering_columns(log_ck, ckey);
if (_enable_updating_state && get_row_state(_clustering_row_states, ckey)) {
_clustering_row_states.erase(ckey);
if (_enable_updating_state) {
if (get_row_state(_clustering_row_states, ckey)) {
_clustering_row_states.erase(ckey);
} else if (_request_options.alternator) {
alternator_add_ckey_to_rows_to_ignore(ckey);
}
}
}
@@ -1540,6 +1623,22 @@ struct process_change_visitor {
_touched_parts.set<stats::part_type::PARTITION_DELETE>();
auto log_ck = _builder.allocate_new_log_row(_partition_delete_op);
if (_enable_updating_state) {
if (_request_options.alternator && _alternator_schema_has_no_clustering_key && _clustering_row_states.empty()) {
// Alternator's table can be with or without clustering key. If the clustering key exists,
// delete request will be `clustered_row_delete` and will be hanlded there.
// If the clustering key doesn't exist, delete request will be `partition_delete` and will be handled here.
// The no-clustering-key case is slightly tricky, because insert of such item is handled by `clustered_row_cells`
// and has some value as clustering_key (the value currently seems to be empty bytes object).
// We don't want to rely on knowing the value exactly, instead we rely on the fact that
// there will be at most one item in a partition. So if `_clustering_row_states` is empty,
// we know the delete is for a non-existing item and we should ignore it.
// If `_clustering_row_states` is not empty, then we know the delete is for an existing item
// we should log it and clear `_clustering_row_states`.
// The same logic applies to `alternator_add_ckey_to_rows_to_ignore` call in `clustered_row_delete`
// we need to insert "anything" for no-clustering-key case, so further logic will check
// if map is empty or not and will know if it should ignore the single partition item and keep it.
alternator_add_ckey_to_rows_to_ignore({});
}
_clustering_row_states.clear();
}
}
@@ -1647,6 +1746,47 @@ private:
stats::part_type_set _touched_parts;
std::unordered_set<bytes> _alternator_clustering_keys_to_ignore;
const column_definition* _alternator_clustering_key_column = nullptr;
// the function will process mutations and remove rows that are in _alternator_clustering_keys_to_ignore
// we need to take care and reindex clustering keys (cdc$batch_seq_no)
// this is used for Alternator's changes only
// NOTE: `_alternator_clustering_keys_to_ignore` must be not empty.
mutation clean_up_noop_rows(mutation mut) {
throwing_assert(!_alternator_clustering_keys_to_ignore.empty());
auto after_mut = mutation(_log_schema, mut.key());
if (!_alternator_clustering_key_column) {
// no clustering key - only single row per partition
// since _alternator_clustering_keys_to_ignore is not empty we need to drop that single row
// so we just return empty mutation instead
return after_mut;
}
int batch_seq = 0;
for (rows_entry &row : mut.partition().mutable_non_dummy_rows()) {
auto cell = row.row().cells().find_cell(_alternator_clustering_key_column->id);
if (cell) {
auto val = cell->as_atomic_cell(*_alternator_clustering_key_column).value().linearize();
if (_alternator_clustering_keys_to_ignore.contains(val)) {
continue;
}
}
auto new_key = _builder->create_ck(batch_seq++);
after_mut.partition().clustered_row(*_log_schema, std::move(new_key)) = std::move(row.row());
}
if (batch_seq > 0) {
// update end_of_batch marker
// we don't need to clear previous one, as we only removed rows
// we need to set it on the last row, because original last row might have been deleted
// batch_seq == 0 -> no rows, after_mut is empty, all entries were dropped and there's nothing to write to cdc log
auto last_key = _builder->create_ck(batch_seq - 1);
after_mut.set_cell(last_key, log_meta_column_name_bytes("end_of_batch"), data_value(true), _builder->get_timestamp(), _builder->get_ttl());
}
return after_mut;
}
public:
transformer(db_context ctx, schema_ptr s, dht::decorated_key dk, const per_request_options& options)
: _ctx(ctx)
@@ -1656,7 +1796,20 @@ public:
, _options(options)
, _clustering_row_states(0, clustering_key::hashing(*_schema), clustering_key::equality(*_schema))
, _uses_tablets(ctx._proxy.get_db().local().find_keyspace(_schema->ks_name()).uses_tablets())
, _alternator_clustering_keys_to_ignore()
{
if (_options.alternator) {
auto cks = _schema->clustering_key_columns();
const column_definition *ck_def = nullptr;
if (!cks.empty()) {
auto it = _log_schema->columns_by_name().find(cks.front().name());
if (it == _log_schema->columns_by_name().end()) {
on_internal_error(cdc_log, fmt::format("failed to find clustering key `{}` in cdc log table `{}`", cks.front().name(), _log_schema->id()));
}
ck_def = it->second;
}
_alternator_clustering_key_column = ck_def;
}
}
// DON'T move the transformer after this
@@ -1664,7 +1817,10 @@ public:
const auto stream_id = _uses_tablets ? _ctx._cdc_metadata.get_tablet_stream(_log_schema->id(), ts, _dk.token()) : _ctx._cdc_metadata.get_vnode_stream(ts, _dk.token());
_result_mutations.emplace_back(_log_schema, stream_id.to_partition_key(*_log_schema));
_builder.emplace(_result_mutations.back(), ts, _dk.key(), *_schema);
_enable_updating_state = _schema->cdc_options().postimage() || (!is_last && _schema->cdc_options().preimage());
// alternator_streams_increased_compatibility set to true reads preimage, but we need to set
// _enable_updating_state to true to keep track of changes and produce correct pre/post images even
// if upper layer didn't request them explicitly.
_enable_updating_state = _schema->cdc_options().postimage() || (!is_last && _schema->cdc_options().preimage()) || (_options.alternator && _options.alternator_streams_increased_compatibility);
}
void produce_preimage(const clustering_key* ck, const one_kind_column_set& columns_to_include) override {
@@ -1761,7 +1917,9 @@ public:
._builder = *_builder,
._enable_updating_state = _enable_updating_state,
._clustering_row_states = _clustering_row_states,
._alternator_clustering_keys_to_ignore = _alternator_clustering_keys_to_ignore,
._static_row_state = _static_row_state,
._alternator_schema_has_no_clustering_key = (_alternator_clustering_key_column == nullptr),
._is_update = _is_update,
._generate_delta_values = generate_delta_values(_builder->base_schema())
};
@@ -1771,10 +1929,19 @@ public:
void end_record() override {
SCYLLA_ASSERT(_builder);
_builder->end_record();
}
const row_states_map& clustering_row_states() const override {
return _clustering_row_states;
if (_options.alternator && !_alternator_clustering_keys_to_ignore.empty()) {
// we filter mutations for Alternator's changes here.
// We do it per mutation object (user might submit a batch of those in one go
// and some might be splitted because of different timestamps),
// ignore key set is cleared afterwards.
// If single mutation object contains two separate changes to the same row
// and at least one of them is ignored, all of them will be ignored.
// This is not possible in Alternator - Alternator spec forbids reusing
// primary key in single batch.
_result_mutations.back() = clean_up_noop_rows(std::move(_result_mutations.back()));
_alternator_clustering_keys_to_ignore.clear();
}
}
// Takes and returns generated cdc log mutations and associated statistics about parts touched during transformer's lifetime.
@@ -2013,7 +2180,7 @@ cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout,
tracing::trace(tr_state, "CDC: Preimage not enabled for the table, not querying current value of {}", m.decorated_key());
}
return f.then([alternator_increased_compatibility, trans = std::move(trans), &mutations, idx, tr_state, &details, &options] (lw_shared_ptr<cql3::untyped_result_set> rs) mutable {
return f.then([trans = std::move(trans), &mutations, idx, tr_state, &details, &options] (lw_shared_ptr<cql3::untyped_result_set> rs) mutable {
auto& m = mutations[idx];
auto& s = m.schema();
@@ -2031,10 +2198,10 @@ cdc::cdc_service::impl::augment_mutation_call(lowres_clock::time_point timeout,
if (should_split(m, options)) {
tracing::trace(tr_state, "CDC: Splitting {}", m.decorated_key());
details.was_split = true;
process_changes_with_splitting(m, trans, preimage, postimage, alternator_increased_compatibility);
process_changes_with_splitting(m, trans, preimage, postimage);
} else {
tracing::trace(tr_state, "CDC: No need to split {}", m.decorated_key());
process_changes_without_splitting(m, trans, preimage, postimage, alternator_increased_compatibility);
process_changes_without_splitting(m, trans, preimage, postimage);
}
auto [log_mut, touched_parts] = std::move(trans).finish();
const int generated_count = log_mut.size();

View File

@@ -6,26 +6,15 @@
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#include "bytes.hh"
#include "bytes_fwd.hh"
#include "mutation/atomic_cell.hh"
#include "mutation/atomic_cell_or_collection.hh"
#include "mutation/collection_mutation.hh"
#include "mutation/mutation.hh"
#include "mutation/tombstone.hh"
#include "schema/schema.hh"
#include <seastar/core/sstring.hh>
#include "types/concrete_types.hh"
#include "types/types.hh"
#include "types/user.hh"
#include "split.hh"
#include "log.hh"
#include "change_visitor.hh"
#include "utils/managed_bytes.hh"
#include <string_view>
#include <unordered_map>
extern logging::logger cdc_log;
@@ -610,109 +599,8 @@ bool should_split(const mutation& m, const per_request_options& options) {
|| v._ts == api::missing_timestamp;
}
// Returns true if the row state and the atomic and nonatomic entries represent
// an equivalent item.
static bool entries_match_row_state(const schema_ptr& base_schema, const cell_map& row_state, const std::vector<atomic_column_update>& atomic_entries,
std::vector<nonatomic_column_update>& nonatomic_entries) {
for (const auto& update : atomic_entries) {
const column_definition& cdef = base_schema->column_at(column_kind::regular_column, update.id);
const auto it = row_state.find(&cdef);
if (it == row_state.end()) {
return false;
}
if (to_managed_bytes_opt(update.cell.value().linearize()) != it->second) {
return false;
}
}
if (nonatomic_entries.empty()) {
return true;
}
for (const auto& update : nonatomic_entries) {
const column_definition& cdef = base_schema->column_at(column_kind::regular_column, update.id);
const auto it = row_state.find(&cdef);
if (it == row_state.end()) {
return false;
}
// The only collection used by Alternator is a non-frozen map.
auto current_raw_map = cdef.type->deserialize(*it->second);
map_type_impl::native_type current_values = value_cast<map_type_impl::native_type>(current_raw_map);
if (current_values.size() != update.cells.size()) {
return false;
}
std::unordered_map<sstring_view, bytes> current_values_map;
for (const auto& entry : current_values) {
const auto attr_name = std::string_view(value_cast<sstring>(entry.first));
current_values_map[attr_name] = value_cast<bytes>(entry.second);
}
for (const auto& [key, value] : update.cells) {
const auto key_str = to_string_view(key);
if (!value.is_live()) {
if (current_values_map.contains(key_str)) {
return false;
}
} else if (current_values_map[key_str] != value.value().linearize()) {
return false;
}
}
}
return true;
}
bool should_skip(batch& changes, const mutation& base_mutation, change_processor& processor) {
const schema_ptr& base_schema = base_mutation.schema();
// Alternator doesn't use static updates and clustered range deletions.
if (!changes.static_updates.empty() || !changes.clustered_range_deletions.empty()) {
return false;
}
for (clustered_row_insert& u : changes.clustered_inserts) {
const cell_map* row_state = get_row_state(processor.clustering_row_states(), u.key);
if (!row_state) {
return false;
}
if (!entries_match_row_state(base_schema, *row_state, u.atomic_entries, u.nonatomic_entries)) {
return false;
}
}
for (clustered_row_update& u : changes.clustered_updates) {
const cell_map* row_state = get_row_state(processor.clustering_row_states(), u.key);
if (!row_state) {
return false;
}
if (!entries_match_row_state(base_schema, *row_state, u.atomic_entries, u.nonatomic_entries)) {
return false;
}
}
// Skip only if the row being deleted does not exist (i.e. the deletion is a no-op).
for (const auto& row_deletion : changes.clustered_row_deletions) {
if (processor.clustering_row_states().contains(row_deletion.key)) {
return false;
}
}
// Don't skip if the item exists.
//
// Increased DynamoDB Streams compatibility guarantees that single-item
// operations will read the item and store it in the clustering row states.
// If it is not found there, we may skip CDC. This is safe as long as the
// assumptions of this operation's write isolation are not violated.
if (changes.partition_deletions && processor.clustering_row_states().contains(clustering_key::make_empty())) {
return false;
}
cdc_log.trace("Skipping CDC log for mutation {}", base_mutation);
return true;
}
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
bool enable_preimage, bool enable_postimage) {
const auto base_schema = base_mutation.schema();
auto changes = extract_changes(base_mutation);
auto pk = base_mutation.key();
@@ -732,10 +620,6 @@ void process_changes_with_splitting(const mutation& base_mutation, change_proces
affected_clustered_columns_per_row = btch.get_affected_clustered_columns_per_row(*base_mutation.schema());
}
if (alternator_strict_compatibility && should_skip(btch, base_mutation, processor)) {
continue;
}
const bool is_last = change_ts == last_timestamp;
processor.begin_timestamp(change_ts, is_last);
if (enable_preimage) {
@@ -825,13 +709,7 @@ void process_changes_with_splitting(const mutation& base_mutation, change_proces
}
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility) {
if (alternator_strict_compatibility) {
auto changes = extract_changes(base_mutation);
if (should_skip(changes.begin()->second, base_mutation, processor)) {
return;
}
}
bool enable_preimage, bool enable_postimage) {
auto ts = find_timestamp(base_mutation);
processor.begin_timestamp(ts, true);

View File

@@ -66,14 +66,12 @@ public:
// Tells processor we have reached end of record - last part
// of a given timestamp batch
virtual void end_record() = 0;
virtual const row_states_map& clustering_row_states() const = 0;
};
bool should_split(const mutation& base_mutation, const per_request_options& options);
void process_changes_with_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility);
bool enable_preimage, bool enable_postimage);
void process_changes_without_splitting(const mutation& base_mutation, change_processor& processor,
bool enable_preimage, bool enable_postimage, bool alternator_strict_compatibility);
bool enable_preimage, bool enable_postimage);
}

View File

@@ -1355,6 +1355,35 @@ private:
_sstables.erase(exhausted, _sstables.end());
dynamic_cast<compaction_read_monitor_generator&>(unwrap_monitor_generator()).remove_exhausted_sstables(exhausted_ssts);
}
// Release exhausted garbage collected sstables.
// A GC sstable is exhausted when it doesn't overlap with any remaining input sstable.
// GC sstables serve as safeguards against data resurrection: their tombstones may shadow
// data in not-yet-exhausted input sstables. So a GC sstable can only be released once
// all overlapping input sstables have been exhausted.
auto gc_not_exhausted = [this] (const sstables::shared_sstable& gc_sst) {
auto gc_range = ::wrapping_interval<dht::token>::make(
gc_sst->get_first_decorated_key()._token,
gc_sst->get_last_decorated_key()._token);
for (const auto& input_sst : _sstables) {
auto input_range = ::wrapping_interval<dht::token>::make(
input_sst->get_first_decorated_key()._token,
input_sst->get_last_decorated_key()._token);
if (gc_range.overlaps(input_range, dht::token_comparator())) {
return true; // overlaps with a remaining input sstable, not exhausted yet
}
}
return false; // no overlap with any remaining input sstable, can be released
};
exhausted = std::partition(_used_garbage_collected_sstables.begin(), _used_garbage_collected_sstables.end(), gc_not_exhausted);
if (exhausted != _used_garbage_collected_sstables.end()) {
auto exhausted_gc_ssts = std::vector<sstables::shared_sstable>(exhausted, _used_garbage_collected_sstables.end());
log_debug("Releasing {} exhausted GC sstable(s) earlier: [{}]",
exhausted_gc_ssts.size(),
fmt::join(exhausted_gc_ssts | std::views::transform([] (auto sst) { return to_string(sst, true); }), ","));
_replacer(get_compaction_completion_desc(std::move(exhausted_gc_ssts), {}));
_used_garbage_collected_sstables.erase(exhausted, _used_garbage_collected_sstables.end());
}
}
void replace_remaining_exhausted_sstables() {

View File

@@ -1106,7 +1106,8 @@ void compaction_manager::enable() {
_compaction_submission_timer.cancel();
_compaction_submission_timer.arm_periodic(periodic_compaction_submission_interval());
_waiting_reevalution = postponed_compactions_reevaluation();
throwing_assert(!_waiting_reevaluation);
_waiting_reevaluation.emplace(postponed_compactions_reevaluation());
cmlog.info("Enabled");
}
@@ -1154,6 +1155,16 @@ void compaction_manager::reevaluate_postponed_compactions() noexcept {
_postponed_reevaluation.signal();
}
future<> compaction_manager::stop_postponed_compactions() noexcept {
auto waiting_reevaluation = std::exchange(_waiting_reevaluation, std::nullopt);
if (!waiting_reevaluation) {
return make_ready_future();
}
// Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber
reevaluate_postponed_compactions();
return std::move(*waiting_reevaluation);
}
void compaction_manager::postpone_compaction_for_table(compaction_group_view* t) {
_postponed.insert(t);
}
@@ -1237,8 +1248,7 @@ future<> compaction_manager::drain() {
_compaction_submission_timer.cancel();
// Stop ongoing compactions, if the request has not been sent already and wait for them to stop.
co_await stop_ongoing_compactions("drain");
// Trigger a signal to properly exit from postponed_compactions_reevaluation() fiber
reevaluate_postponed_compactions();
co_await stop_postponed_compactions();
cmlog.info("Drained");
}
@@ -1282,8 +1292,7 @@ future<> compaction_manager::really_do_stop() noexcept {
if (!_tasks.empty()) {
on_fatal_internal_error(cmlog, format("{} tasks still exist after being stopped", _tasks.size()));
}
reevaluate_postponed_compactions();
co_await std::move(_waiting_reevalution);
co_await stop_postponed_compactions();
co_await _sys_ks.close();
_weight_tracker.clear();
_compaction_submission_timer.cancel();

View File

@@ -128,7 +128,7 @@ private:
// a sstable from being compacted twice.
std::unordered_set<sstables::shared_sstable> _compacting_sstables;
future<> _waiting_reevalution = make_ready_future<>();
std::optional<future<>> _waiting_reevaluation;
condition_variable _postponed_reevaluation;
// tables that wait for compaction but had its submission postponed due to ongoing compaction.
std::unordered_set<compaction::compaction_group_view*> _postponed;
@@ -231,6 +231,7 @@ private:
future<> postponed_compactions_reevaluation();
void reevaluate_postponed_compactions() noexcept;
future<> stop_postponed_compactions() noexcept;
// Postpone compaction for a table that couldn't be executed due to ongoing
// similar-sized compaction.
void postpone_compaction_for_table(compaction::compaction_group_view* t);

View File

@@ -698,12 +698,13 @@ public:
table_resharding_compaction_task_impl(tasks::task_manager::module_ptr module,
std::string keyspace,
std::string table,
tasks::task_id parent_id,
sharded<sstables::sstable_directory>& dir,
sharded<replica::database>& db,
compaction_sstable_creator_fn creator,
compaction::owned_ranges_ptr owned_ranges_ptr,
bool vnodes_resharding) noexcept
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), module->new_sequence_number(), "table", std::move(keyspace), std::move(table), "", tasks::task_id::create_null_id())
: resharding_compaction_task_impl(module, tasks::task_id::create_random_id(), parent_id ? 0 : module->new_sequence_number(), "table", std::move(keyspace), std::move(table), "", parent_id)
, _dir(dir)
, _db(db)
, _creator(std::move(creator))

View File

@@ -1438,6 +1438,8 @@ alternator = [
'alternator/controller.cc',
'alternator/server.cc',
'alternator/executor.cc',
'alternator/executor_read.cc',
'alternator/executor_util.cc',
'alternator/stats.cc',
'alternator/serialization.cc',
'alternator/expressions.cc',
@@ -1723,6 +1725,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/view_schema_test.cc',
'test/boost/virtual_reader_test.cc',
'test/boost/virtual_table_test.cc',
'test/boost/vnodes_to_tablets_migration_test.cc',
'tools/schema_loader.cc',
'tools/read_mutation.cc',
'test/lib/expr_test_utils.cc',

View File

@@ -23,7 +23,7 @@ set_property(
$<$<CONFIG:${unoptimized_modes}>:-O1>
# use-after-scope sanitizer also uses large amount of stack space
# and overflows the stack of CqlParser
$<$<CONFIG:${sanitized_modes}>:-fsanitize-address-use-after-scope>)
$<$<CONFIG:${sanitized_modes}>:-fno-sanitize-address-use-after-scope>)
add_library(cql3 STATIC)
target_sources(cql3

View File

@@ -429,10 +429,10 @@ unaliasedSelector returns [uexpression tmp]
: ( c=cident { tmp = unresolved_identifier{std::move(c)}; }
| v=value { tmp = std::move(v); }
| K_COUNT '(' countArgument ')' { tmp = make_count_rows_function_expression(); }
| K_WRITETIME '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::writetime,
unresolved_identifier{std::move(c)}}; }
| K_TTL '(' c=cident ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::ttl,
unresolved_identifier{std::move(c)}}; }
| K_WRITETIME '(' a=subscriptExpr ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::writetime,
std::move(a)}; }
| K_TTL '(' a=subscriptExpr ')' { tmp = column_mutation_attribute{column_mutation_attribute::attribute_kind::ttl,
std::move(a)}; }
| f=functionName args=selectionFunctionArgs { tmp = function_call{std::move(f), std::move(args)}; }
| K_CAST '(' arg=unaliasedSelector K_AS t=native_type ')' { tmp = cast{.style = cast::cast_style::sql, .arg = std::move(arg), .type = std::move(t)}; }
)
@@ -1794,7 +1794,9 @@ columnRefExpr returns [uexpression e]
subscriptExpr returns [uexpression e]
: col=columnRefExpr { e = std::move(col); }
( '[' sub=term ']' { e = subscript{std::move(e), std::move(sub)}; } )?
( '[' sub=term ']' { e = subscript{std::move(e), std::move(sub)}; }
| '.' fi=cident { e = field_selection{std::move(e), std::move(fi)}; }
)?
;
singleColumnInValuesOrMarkerExpr returns [uexpression e]

View File

@@ -10,6 +10,7 @@
#include "utils/assert.hh"
#include "cql3/column_specification.hh"
#include "cql3/column_identifier.hh"
namespace cql3 {
@@ -31,4 +32,12 @@ bool column_specification::all_in_same_table(const std::vector<lw_shared_ptr<col
});
}
lw_shared_ptr<column_specification> make_column_spec(std::string_view ks_name, std::string_view cf_name, sstring name, data_type type) {
return make_lw_shared<column_specification>(
ks_name,
cf_name,
::make_shared<column_identifier>(std::move(name), true),
std::move(type));
}
}

View File

@@ -42,4 +42,6 @@ public:
static bool all_in_same_table(const std::vector<lw_shared_ptr<column_specification>>& names);
};
lw_shared_ptr<column_specification> make_column_spec(std::string_view ks_name, std::string_view cf_name, sstring name, data_type type);
}

View File

@@ -11,6 +11,11 @@
#pragma once
#include "restrictions/restrictions_config.hh"
#include "cql3/restrictions/replication_restrictions.hh"
#include "cql3/restrictions/twcs_restrictions.hh"
#include "cql3/restrictions/view_restrictions.hh"
#include "db/tri_mode_restriction.hh"
#include "utils/updateable_value.hh"
namespace db { class config; }
@@ -18,9 +23,44 @@ namespace cql3 {
struct cql_config {
restrictions::restrictions_config restrictions;
explicit cql_config(const db::config& cfg) : restrictions(cfg) {}
replication_restrictions replication_restrictions;
twcs_restrictions twcs_restrictions;
view_restrictions view_restrictions;
utils::updateable_value<uint32_t> select_internal_page_size;
utils::updateable_value<db::tri_mode_restriction> strict_allow_filtering;
utils::updateable_value<bool> enable_parallelized_aggregation;
utils::updateable_value<uint32_t> batch_size_warn_threshold_in_kb;
utils::updateable_value<uint32_t> batch_size_fail_threshold_in_kb;
utils::updateable_value<bool> restrict_future_timestamp;
utils::updateable_value<bool> enable_create_table_with_compact_storage;
explicit cql_config(const db::config& cfg)
: restrictions(cfg)
, replication_restrictions(cfg)
, twcs_restrictions(cfg)
, view_restrictions(cfg)
, select_internal_page_size(cfg.select_internal_page_size)
, strict_allow_filtering(cfg.strict_allow_filtering)
, enable_parallelized_aggregation(cfg.enable_parallelized_aggregation)
, batch_size_warn_threshold_in_kb(cfg.batch_size_warn_threshold_in_kb)
, batch_size_fail_threshold_in_kb(cfg.batch_size_fail_threshold_in_kb)
, restrict_future_timestamp(cfg.restrict_future_timestamp)
, enable_create_table_with_compact_storage(cfg.enable_create_table_with_compact_storage)
{}
struct default_tag{};
cql_config(default_tag) : restrictions(restrictions::restrictions_config::default_tag{}) {}
cql_config(default_tag)
: restrictions(restrictions::restrictions_config::default_tag{})
, replication_restrictions(replication_restrictions::default_tag{})
, twcs_restrictions(twcs_restrictions::default_tag{})
, view_restrictions(view_restrictions::default_tag{})
, select_internal_page_size(10000)
, strict_allow_filtering(db::tri_mode_restriction(db::tri_mode_restriction_t::mode::WARN))
, enable_parallelized_aggregation(true)
, batch_size_warn_threshold_in_kb(128)
, batch_size_fail_threshold_in_kb(1024)
, restrict_future_timestamp(true)
, enable_create_table_with_compact_storage(false)
{}
};
extern const cql_config default_cql_config;

View File

@@ -0,0 +1,21 @@
// Copyright (C) 2026-present ScyllaDB
// SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.0
#pragma once
#include <map>
#include "bytes.hh"
#include "mutation/timestamp.hh"
namespace cql3::expr {
// Per-element timestamps and TTLs for a cell of a map, set or UDT (populated
// when a WRITETIME() or TTL() of col[key] or col.field are in the query.
// Keys are the raw serialized keys or serialized field index.
struct collection_cell_metadata {
std::map<bytes, api::timestamp_type> timestamps;
std::map<bytes, int32_t> ttls; // remaining TTL in seconds (-1 if no TTL)
};
} // namespace cql3::expr

View File

@@ -3,6 +3,7 @@
#pragma once
#include "collection_cell_metadata.hh"
#include "expression.hh"
#include "bytes.hh"
@@ -27,6 +28,7 @@ struct evaluation_inputs {
std::span<const api::timestamp_type> static_and_regular_timestamps; // indexes match `selection` member
std::span<const int32_t> static_and_regular_ttls; // indexes match `selection` member
std::span<const cql3::raw_value> temporaries; // indexes match temporary::index
std::span<const collection_cell_metadata> collection_element_metadata; // indexes match `selection` member
};
// Takes a prepared expression and calculates its value.

View File

@@ -1031,7 +1031,7 @@ expression search_and_replace(const expression& e,
return cast{c.style, recurse(c.arg), c.type};
},
[&] (const field_selection& fs) -> expression {
return field_selection{recurse(fs.structure), fs.field};
return field_selection{recurse(fs.structure), fs.field, fs.field_idx, fs.type};
},
[&] (const subscript& s) -> expression {
return subscript {
@@ -1206,6 +1206,58 @@ cql3::raw_value do_evaluate(const field_selection& field_select, const evaluatio
static
cql3::raw_value
do_evaluate(const column_mutation_attribute& cma, const evaluation_inputs& inputs) {
// Helper for WRITETIME/TTL on a collection element or UDT field: given the
// inner column and the serialized element key, validate the index and look
// up the per-element timestamp or TTL in collection_element_metadata.
auto lookup_element_attribute = [&](const column_value* inner_col, std::string_view context, bytes key) -> cql3::raw_value {
int32_t index = inputs.selection->index_of(*inner_col->col);
if (inputs.collection_element_metadata.empty() || index < 0 || size_t(index) >= inputs.collection_element_metadata.size()) {
on_internal_error(expr_logger, fmt::format("evaluating column_mutation_attribute {}: column {} is not in selection",
context, inner_col->col->name_as_text()));
}
const auto& meta = inputs.collection_element_metadata[index];
switch (cma.kind) {
case column_mutation_attribute::attribute_kind::writetime: {
const auto it = meta.timestamps.find(key);
if (it == meta.timestamps.end()) {
return cql3::raw_value::make_null();
}
return raw_value::make_value(data_value(it->second).serialize());
}
case column_mutation_attribute::attribute_kind::ttl: {
const auto it = meta.ttls.find(key);
// The test it->second <= 0 (rather than < 0) matches the
// single-TTL check ttl_v <= 0 below.
if (it == meta.ttls.end() || it->second <= 0) {
return cql3::raw_value::make_null();
}
return raw_value::make_value(data_value(it->second).serialize());
}
}
on_internal_error(expr_logger, fmt::format("evaluating column_mutation_attribute {} with unexpected kind", context));
};
// Handle WRITETIME(x.field) / TTL(x.field) on a UDT field
if (auto fs = expr::as_if<field_selection>(&cma.column)) {
auto inner_col = expr::as_if<column_value>(&fs->structure);
if (!inner_col) {
on_internal_error(expr_logger, fmt::format("evaluating column_mutation_attribute field_selection: inner expression is not a column: {}", fs->structure));
}
return lookup_element_attribute(inner_col, "field_selection", serialize_field_index(fs->field_idx));
}
// Handle WRITETIME(m[key]) / TTL(m[key]) on a map element
if (auto sub = expr::as_if<subscript>(&cma.column)) {
auto inner_col = expr::as_if<column_value>(&sub->val);
if (!inner_col) {
on_internal_error(expr_logger, fmt::format("evaluating column_mutation_attribute subscript: inner expression is not a column: {}", sub->val));
}
auto evaluated_key = evaluate(sub->sub, inputs);
if (evaluated_key.is_null()) {
return cql3::raw_value::make_null();
}
return evaluated_key.view().with_linearized([&] (bytes_view key_bv) {
return lookup_element_attribute(inner_col, "subscript", bytes(key_bv));
});
}
auto col = expr::as_if<column_value>(&cma.column);
if (!col) {
on_internal_error(expr_logger, fmt::format("evaluating column_mutation_attribute of non-column {}", cma.column));

View File

@@ -1259,6 +1259,40 @@ prepare_column_mutation_attribute(
receiver->type->name(), receiver->name->text()));
}
auto column = prepare_expression(cma.column, db, keyspace, schema_opt, nullptr);
// Helper for the subscript and field-selection cases below: validates that
// inner_expr is a column, not a primary key column, that its type satisfies
// type_allowed, and that the cluster feature flag is on.
auto validate_and_return =
[&](const expression& inner_expr, std::string_view context,
auto type_allowed, std::string_view type_allowed_str) -> std::optional<expression> {
auto inner_cval = expr::as_if<column_value>(&inner_expr);
if (!inner_cval) {
throw exceptions::invalid_request_exception(fmt::format("{} on a {} expects a column, got {}", cma.kind, context, inner_expr));
}
if (inner_cval->col->is_primary_key()) {
throw exceptions::invalid_request_exception(fmt::format("{} is not legal on primary key component {}", cma.kind, inner_cval->col->name_as_text()));
}
if (!type_allowed(inner_cval->col->type)) {
throw exceptions::invalid_request_exception(fmt::format("{} on a {} is only valid for {}", cma.kind, context, type_allowed_str));
}
if (!db.features().writetime_ttl_individual_element) {
throw exceptions::invalid_request_exception(fmt::format(
"{} on a {} is not supported until all nodes in the cluster are upgraded", cma.kind, context));
}
return column_mutation_attribute{.kind = cma.kind, .column = std::move(column)};
};
// Handle WRITETIME(m[key]) / TTL(m[key]) - a subscript into a non-frozen map or set column
if (auto sub = expr::as_if<subscript>(&column)) {
return validate_and_return(sub->val, "subscript",
[](const data_type& t) { return (t->is_map() || t->is_set()) && t->is_multi_cell(); },
"non-frozen map or set columns");
}
// Handle WRITETIME(x.field) / TTL(x.field) - a field selection into a non-frozen UDT column
if (auto fs = expr::as_if<field_selection>(&column)) {
return validate_and_return(fs->structure, "field selection",
[](const data_type& t) { return t->is_user_type() && t->is_multi_cell(); },
"non-frozen UDT columns");
}
auto cval = expr::as_if<column_value>(&column);
if (!cval) {
throw exceptions::invalid_request_exception(fmt::format("{} expects a column, but {} is a general expression", cma.kind, column));
@@ -1654,6 +1688,12 @@ static lw_shared_ptr<column_specification> get_lhs_receiver(const expression& pr
return list_value_spec_of(*sub_col.col->column_specification);
}
},
[&](const field_selection& fs) -> lw_shared_ptr<column_specification> {
return make_lw_shared<column_specification>(
schema.ks_name(), schema.cf_name(),
::make_shared<column_identifier>(fs.field->text(), true),
fs.type);
},
[&](const tuple_constructor& tup) -> lw_shared_ptr<column_specification> {
std::ostringstream tuple_name;
tuple_name << "(";

View File

@@ -560,6 +560,11 @@ query_processor::acquire_strongly_consistent_coordinator() {
return {remote_.get().sc_coordinator, std::move(holder)};
}
service::storage_service& query_processor::storage_service() {
auto [remote_, holder] = remote();
return remote_.get().ss;
}
void query_processor::start_remote(service::migration_manager& mm, service::mapreduce_service& mapreducer,
service::storage_service& ss, service::raft_group0_client& group0_client,
service::strong_consistency::coordinator& sc_coordinator) {
@@ -786,7 +791,7 @@ query_processor::get_statement(const std::string_view& query, const service::cli
cf_stmt->prepare_keyspace(client_state);
}
++_stats.prepare_invocations;
auto p = statement->prepare(_db, _cql_stats);
auto p = statement->prepare(_db, _cql_stats, _cql_config);
p->statement->raw_cql_statement = sstring(query);
auto audit_info = p->statement->get_audit_info();
if (audit_info) {
@@ -901,7 +906,7 @@ query_options query_processor::make_internal_options(
statements::prepared_statement::checked_weak_ptr query_processor::prepare_internal(const sstring& query_string) {
auto& p = _internal_statements[query_string];
if (p == nullptr) {
auto np = parse_statement(query_string, internal_dialect())->prepare(_db, _cql_stats);
auto np = parse_statement(query_string, internal_dialect())->prepare(_db, _cql_stats, _cql_config);
np->statement->raw_cql_statement = query_string;
p = std::move(np); // inserts it into map
}
@@ -1012,7 +1017,7 @@ query_processor::execute_internal(
return execute_with_params(std::move(p), cl, query_state, values);
} else {
// For internal queries, we want the default dialect, not the user provided one
auto p = parse_statement(query_string, dialect{})->prepare(_db, _cql_stats);
auto p = parse_statement(query_string, dialect{})->prepare(_db, _cql_stats, _cql_config);
p->statement->raw_cql_statement = query_string;
auto checked_weak_ptr = p->checked_weak_from_this();
return execute_with_params(std::move(checked_weak_ptr), cl, query_state, values).finally([p = std::move(p)] {});
@@ -1071,6 +1076,11 @@ query_processor::execute_batch_without_checking_exception_message(
query_options& options,
std::unordered_map<prepared_cache_key_type, authorized_prepared_statements_cache::value_type> pending_authorization_entries) {
auto access_future = co_await coroutine::as_future(batch->check_access(*this, query_state.get_client_state()));
bool failed = access_future.failed();
co_await audit::inspect(batch, query_state, options, failed);
if (failed) {
std::rethrow_exception(access_future.get_exception());
}
co_await coroutine::parallel_for_each(pending_authorization_entries, [this, &query_state] (auto& e) -> future<> {
try {
co_await _authorized_prepared_cache.insert(*query_state.get_client_state().user(), e.first, std::move(e.second));
@@ -1078,11 +1088,6 @@ query_processor::execute_batch_without_checking_exception_message(
log.error("failed to cache the entry: {}", std::current_exception());
}
});
bool failed = access_future.failed();
co_await audit::inspect(batch, query_state, options, failed);
if (access_future.failed()) {
std::rethrow_exception(access_future.get_exception());
}
batch->validate();
batch->validate(*this, query_state.get_client_state());
_stats.queries_by_cl[size_t(options.get_consistency())] += batch->get_statements().size();

View File

@@ -209,6 +209,8 @@ public:
return _proxy;
}
service::storage_service& storage_service();
std::pair<std::reference_wrapper<service::strong_consistency::coordinator>, gate::holder>
acquire_strongly_consistent_coordinator();

View File

@@ -0,0 +1,47 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "db/config.hh"
#include "utils/updateable_value.hh"
namespace cql3 {
struct replication_restrictions {
utils::updateable_value<db::tri_mode_restriction> restrict_replication_simplestrategy;
utils::updateable_value<std::vector<enum_option<db::replication_strategy_restriction_t>>> replication_strategy_warn_list;
utils::updateable_value<std::vector<enum_option<db::replication_strategy_restriction_t>>> replication_strategy_fail_list;
utils::updateable_value<int> minimum_replication_factor_fail_threshold;
utils::updateable_value<int> minimum_replication_factor_warn_threshold;
utils::updateable_value<int> maximum_replication_factor_fail_threshold;
utils::updateable_value<int> maximum_replication_factor_warn_threshold;
explicit replication_restrictions(const db::config& cfg)
: restrict_replication_simplestrategy(cfg.restrict_replication_simplestrategy)
, replication_strategy_warn_list(cfg.replication_strategy_warn_list)
, replication_strategy_fail_list(cfg.replication_strategy_fail_list)
, minimum_replication_factor_fail_threshold(cfg.minimum_replication_factor_fail_threshold)
, minimum_replication_factor_warn_threshold(cfg.minimum_replication_factor_warn_threshold)
, maximum_replication_factor_fail_threshold(cfg.maximum_replication_factor_fail_threshold)
, maximum_replication_factor_warn_threshold(cfg.maximum_replication_factor_warn_threshold)
{}
struct default_tag{};
replication_restrictions(default_tag)
: restrict_replication_simplestrategy(db::tri_mode_restriction(db::tri_mode_restriction_t::mode::FALSE))
, replication_strategy_warn_list(std::vector<enum_option<db::replication_strategy_restriction_t>>{})
, replication_strategy_fail_list(std::vector<enum_option<db::replication_strategy_restriction_t>>{})
, minimum_replication_factor_fail_threshold(-1)
, minimum_replication_factor_warn_threshold(3)
, maximum_replication_factor_fail_threshold(-1)
, maximum_replication_factor_warn_threshold(-1)
{}
};
} // namespace cql3

View File

@@ -0,0 +1,32 @@
/*
* Copyright (C) 2026-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "db/config.hh"
#include "utils/updateable_value.hh"
namespace cql3 {
struct twcs_restrictions {
utils::updateable_value<uint32_t> twcs_max_window_count;
utils::updateable_value<db::tri_mode_restriction> restrict_twcs_without_default_ttl;
explicit twcs_restrictions(const db::config& cfg)
: twcs_max_window_count(cfg.twcs_max_window_count)
, restrict_twcs_without_default_ttl(cfg.restrict_twcs_without_default_ttl)
{}
struct default_tag{};
twcs_restrictions(default_tag)
: twcs_max_window_count(10000)
, restrict_twcs_without_default_ttl(db::tri_mode_restriction(db::tri_mode_restriction_t::mode::WARN))
{}
};
} // namespace cql3

View File

@@ -0,0 +1,32 @@
/*
* Copyright (C) 2019-present ScyllaDB
*/
/*
* SPDX-License-Identifier: LicenseRef-ScyllaDB-Source-Available-1.1
*/
#pragma once
#include "db/config.hh"
#include "db/tri_mode_restriction.hh"
#include "utils/updateable_value.hh"
namespace db { class config; }
namespace cql3 {
struct view_restrictions {
utils::updateable_value<db::tri_mode_restriction> strict_is_not_null_in_views;
explicit view_restrictions(const db::config& cfg)
: strict_is_not_null_in_views(cfg.strict_is_not_null_in_views)
{}
struct default_tag{};
view_restrictions(default_tag)
: strict_is_not_null_in_views(db::tri_mode_restriction(db::tri_mode_restriction_t::mode::WARN))
{}
};
}

View File

@@ -17,6 +17,7 @@
#include "cql3/expr/expr-utils.hh"
#include "cql3/functions/first_function.hh"
#include "cql3/functions/aggregate_fcts.hh"
#include "types/types.hh"
#include <ranges>
@@ -31,12 +32,14 @@ selection::selection(schema_ptr schema,
std::vector<lw_shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs,
bool collect_collection_timestamps,
trivial is_trivial)
: _schema(std::move(schema))
, _columns(std::move(columns))
, _metadata(::make_shared<metadata>(std::move(metadata_)))
, _collect_timestamps(collect_timestamps)
, _collect_TTLs(collect_TTLs)
, _collect_collection_timestamps(collect_collection_timestamps)
, _contains_static_columns(std::any_of(_columns.begin(), _columns.end(), std::mem_fn(&column_definition::is_static)))
, _is_trivial(is_trivial)
{ }
@@ -46,6 +49,7 @@ query::partition_slice::option_set selection::get_query_options() {
opts.set_if<query::partition_slice::option::send_timestamp>(_collect_timestamps);
opts.set_if<query::partition_slice::option::send_expiry>(_collect_TTLs);
opts.set_if<query::partition_slice::option::send_collection_timestamps>(_collect_collection_timestamps);
opts.set_if<query::partition_slice::option::send_partition_key>(
std::any_of(_columns.begin(), _columns.end(),
@@ -114,7 +118,7 @@ public:
*/
simple_selection(schema_ptr schema, std::vector<const column_definition*> columns,
std::vector<lw_shared_ptr<column_specification>> metadata, bool is_wildcard)
: selection(schema, std::move(columns), std::move(metadata), false, false, trivial::yes)
: selection(schema, std::move(columns), std::move(metadata), false, false, false, trivial::yes)
, _is_wildcard(is_wildcard)
{ }
@@ -178,6 +182,12 @@ contains_column_mutation_attribute(expr::column_mutation_attribute::attribute_ki
});
}
static bool contains_collection_mutation_attribute(const expr::expression& e) {
return expr::find_in_expression<expr::column_mutation_attribute>(e, [](const expr::column_mutation_attribute& cma) {
return expr::is<expr::subscript>(cma.column) || expr::is<expr::field_selection>(cma.column);
});
}
static
bool
contains_writetime(const expr::expression& e) {
@@ -202,7 +212,8 @@ public:
std::vector<expr::expression> selectors)
: selection(schema, std::move(columns), std::move(metadata),
contains_writetime(expr::tuple_constructor{selectors}),
contains_ttl(expr::tuple_constructor{selectors}))
contains_ttl(expr::tuple_constructor{selectors}),
contains_collection_mutation_attribute(expr::tuple_constructor{selectors}))
, _selectors(std::move(selectors))
{
auto agg_split = expr::split_aggregation(_selectors);
@@ -391,6 +402,7 @@ protected:
.static_and_regular_timestamps = rs._timestamps,
.static_and_regular_ttls = rs._ttls,
.temporaries = {},
.collection_element_metadata = rs._collection_element_metadata,
};
for (auto&& e : _sel._selectors) {
auto out = expr::evaluate(e, inputs);
@@ -429,6 +441,7 @@ protected:
.static_and_regular_timestamps = rs._timestamps,
.static_and_regular_ttls = rs._ttls,
.temporaries = _temporaries,
.collection_element_metadata = rs._collection_element_metadata,
};
for (size_t i = 0; i != _sel._inner_loop.size(); ++i) {
_temporaries[i] = expr::evaluate(_sel._inner_loop[i], inputs);
@@ -553,6 +566,9 @@ result_set_builder::result_set_builder(const selection& s, gc_clock::time_point
if (s._collect_TTLs) {
_ttls.resize(s._columns.size(), 0);
}
if (s._collect_collection_timestamps) {
_collection_element_metadata.resize(s._columns.size());
}
}
void result_set_builder::add_empty() {
@@ -563,6 +579,9 @@ void result_set_builder::add_empty() {
if (!_ttls.empty()) {
_ttls[current.size() - 1] = -1;
}
if (!_collection_element_metadata.empty()) {
_collection_element_metadata[current.size() - 1] = {};
}
}
void result_set_builder::add(bytes_opt value) {
@@ -585,8 +604,45 @@ void result_set_builder::add(const column_definition& def, const query::result_a
}
void result_set_builder::add_collection(const column_definition& def, bytes_view c) {
size_t col_idx = current.size();
if (!_collection_element_metadata.empty()) {
// Extended format produced by serialize_for_cql_with_timestamps()
// [uint32 cql_len][cql bytes][int32 entry_count]
// followed by entry_count entries, each:
// [int32 key_len][key bytes][int64 timestamp][int64 expiry_raw]
// where expiry_raw is -1 if the element does not expire, otherwise
// it is the serialized gc_clock time used to derive the remaining
// TTL. The flag _collect_collection_timestamps = true determines
// whether this extended format is used (instead of a plain CQL
// collection blob), and it is only enabled when a feature flag
// guarantees both reader and writer support it.
uint32_t cql_len = read_simple<uint32_t>(c);
bytes_view cql_bytes = read_simple_bytes(c, cql_len);
current.emplace_back(to_bytes(cql_bytes));
auto& meta = _collection_element_metadata[col_idx];
meta = {}; // clear stale data from previous row
int32_t entry_count = read_simple<int32_t>(c);
for (int32_t i = 0; i < entry_count; ++i) {
int32_t key_len = read_simple<int32_t>(c);
bytes key = to_bytes(read_simple_bytes(c, key_len));
int64_t ts = read_simple<int64_t>(c);
int64_t expiry_raw = read_simple<int64_t>(c);
meta.timestamps[key] = ts;
if (expiry_raw != -1) {
auto expiry = gc_clock::time_point(gc_clock::duration(expiry_raw));
auto ttl_left = expiry - _now;
int32_t ttl = int32_t(ttl_left.count());
if (ttl > 0) {
meta.ttls[key] = ttl;
}
// otherwise, expired or no TTL; We can omit this key from
// map - missing key is treated as null by the evaluator.
}
}
return;
}
current.emplace_back(to_bytes(c));
// timestamps, ttls meaningless for collections
}
void result_set_builder::update_last_group() {

View File

@@ -12,6 +12,7 @@
#include "utils/assert.hh"
#include "bytes.hh"
#include "cql3/expr/collection_cell_metadata.hh"
#include "schema/schema_fwd.hh"
#include "query/query-result-reader.hh"
#include "selector.hh"
@@ -69,6 +70,7 @@ private:
::shared_ptr<metadata> _metadata;
const bool _collect_timestamps;
const bool _collect_TTLs;
const bool _collect_collection_timestamps;
const bool _contains_static_columns;
bool _is_trivial;
protected:
@@ -78,7 +80,9 @@ protected:
std::vector<const column_definition*> columns,
std::vector<lw_shared_ptr<column_specification>> metadata_,
bool collect_timestamps,
bool collect_TTLs, trivial is_trivial = trivial::no);
bool collect_TTLs,
bool collect_collection_timestamps,
trivial is_trivial = trivial::no);
virtual ~selection() {}
public:
@@ -197,6 +201,7 @@ public:
std::vector<bytes> current_clustering_key;
std::vector<api::timestamp_type> _timestamps;
std::vector<int32_t> _ttls;
std::vector<cql3::expr::collection_cell_metadata> _collection_element_metadata;
const query_options* _options;
private:
const gc_clock::time_point _now;

View File

@@ -27,6 +27,7 @@
#include "data_dictionary/data_dictionary.hh"
#include "data_dictionary/keyspace_metadata.hh"
#include "cql3/query_processor.hh"
#include "cql3/cql_config.hh"
#include "cql3/statements/ks_prop_defs.hh"
#include "create_keyspace_statement.hh"
#include "gms/feature_service.hh"
@@ -260,14 +261,14 @@ cql3::statements::alter_keyspace_statement::prepare_schema_mutations(query_proce
}
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_keyspace_statement::prepare(data_dictionary::database db, cql_stats& stats) {
cql3::statements::alter_keyspace_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<alter_keyspace_statement>(*this));
}
future<::shared_ptr<cql_transport::messages::result_message>>
cql3::statements::alter_keyspace_statement::execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
std::vector<sstring> warnings = check_against_restricted_replication_strategies(qp, keyspace(), *_attrs, qp.get_cql_stats());
std::vector<sstring> warnings = check_against_restricted_replication_strategies(qp, keyspace(), *_attrs, qp.get_cql_stats(), qp.get_cql_config().replication_restrictions);
return schema_altering_statement::execute(qp, state, options, std::move(guard)).then([warnings = std::move(warnings)] (::shared_ptr<messages::result_message> msg) {
for (const auto& warning : warnings) {
msg->add_warning(warning);

View File

@@ -37,7 +37,7 @@ public:
future<> check_access(query_processor& qp, const service::client_state& state) const override;
void validate(query_processor& qp, const service::client_state& state) const override;
virtual future<std::tuple<::shared_ptr<event_t>, cql3::cql_warnings_vec>> prepare_schema_mutations(query_processor& qp, service::query_state& state, const query_options& options, service::group0_batch& mc) const override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
virtual future<::shared_ptr<messages::result_message>> execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const override;
bool changes_tablets(query_processor& qp) const;
};

View File

@@ -33,7 +33,7 @@ public:
, _options(std::move(options)) {
}
std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
virtual future<> check_access(query_processor& qp, const service::client_state&) const override;

View File

@@ -25,7 +25,7 @@ alter_service_level_statement::alter_service_level_statement(sstring service_lev
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_service_level_statement::prepare(
data_dictionary::database db, cql_stats &stats) {
data_dictionary::database db, cql_stats &stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), ::make_shared<alter_service_level_statement>(*this));
}

View File

@@ -23,7 +23,7 @@ class alter_service_level_statement final : public service_level_statement {
public:
alter_service_level_statement(sstring service_level, shared_ptr<sl_prop_defs> attrs);
std::unique_ptr<cql3::statements::prepared_statement> prepare(data_dictionary::database db, cql_stats &stats) override;
std::unique_ptr<cql3::statements::prepared_statement> prepare(data_dictionary::database db, cql_stats &stats, const cql_config& cfg) override;
virtual future<> check_access(query_processor& qp, const service::client_state&) const override;
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor&, service::query_state&, const query_options&, std::optional<service::group0_guard> guard) const override;

View File

@@ -14,6 +14,7 @@
#include "utils/assert.hh"
#include <seastar/core/coroutine.hh>
#include "cql3/query_options.hh"
#include "cql3/cql_config.hh"
#include "cql3/statements/alter_table_statement.hh"
#include "cql3/statements/alter_type_statement.hh"
#include "exceptions/exceptions.hh"
@@ -560,7 +561,7 @@ alter_table_statement::prepare_schema_mutations(query_processor& qp, const query
}
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::alter_table_statement::prepare(data_dictionary::database db, cql_stats& stats) {
cql3::statements::alter_table_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
// Cannot happen; alter_table_statement is never instantiated as a raw statement
// (instead we instantiate alter_table_statement::raw_statement)
utils::on_internal_error("alter_table_statement cannot be prepared. Use alter_table_statement::raw_statement instead");
@@ -589,10 +590,10 @@ alter_table_statement::raw_statement::raw_statement(cf_name name,
{}
std::unique_ptr<cql3::statements::prepared_statement>
alter_table_statement::raw_statement::prepare(data_dictionary::database db, cql_stats& stats) {
alter_table_statement::raw_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
auto t = db.try_find_table(keyspace(), column_family());
std::optional<schema_ptr> s = t ? std::make_optional(t->schema()) : std::nullopt;
std::optional<sstring> warning = check_restricted_table_properties(db, s, keyspace(), column_family(), *_properties);
std::optional<sstring> warning = check_restricted_table_properties(s, keyspace(), column_family(), *_properties, cfg.twcs_restrictions);
if (warning) {
// FIXME: should this warning be returned to the caller?
// See https://github.com/scylladb/scylladb/issues/20945

View File

@@ -64,7 +64,7 @@ public:
virtual uint32_t get_bound_terms() const override;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
virtual future<::shared_ptr<messages::result_message>> execute(query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const override;
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chunked_vector<mutation>, cql3::cql_warnings_vec>> prepare_schema_mutations(query_processor& qp, const query_options& options, api::timestamp_type) const override;
@@ -92,7 +92,7 @@ public:
std::unique_ptr<attributes::raw> attrs,
shared_ptr<column_identifier::raw> ttl_change);
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
virtual audit::statement_category category() const override { return audit::statement_category::DDL; }
};

View File

@@ -209,12 +209,12 @@ user_type alter_type_statement::renames::make_updated_type(data_dictionary::data
}
std::unique_ptr<cql3::statements::prepared_statement>
alter_type_statement::add_or_alter::prepare(data_dictionary::database db, cql_stats& stats) {
alter_type_statement::add_or_alter::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<alter_type_statement::add_or_alter>(*this));
}
std::unique_ptr<cql3::statements::prepared_statement>
alter_type_statement::renames::prepare(data_dictionary::database db, cql_stats& stats) {
alter_type_statement::renames::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<alter_type_statement::renames>(*this));
}

View File

@@ -54,7 +54,7 @@ public:
const shared_ptr<column_identifier> field_name,
const shared_ptr<cql3_type::raw> field_type);
virtual user_type make_updated_type(data_dictionary::database db, user_type to_update) const override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
private:
user_type do_add(data_dictionary::database db, user_type to_update) const;
user_type do_alter(data_dictionary::database db, user_type to_update) const;
@@ -71,7 +71,7 @@ public:
void add_rename(shared_ptr<column_identifier> previous_name, shared_ptr<column_identifier> new_name);
virtual user_type make_updated_type(data_dictionary::database db, user_type to_update) const override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
};
}

View File

@@ -98,7 +98,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
}
std::unique_ptr<cql3::statements::prepared_statement>
alter_view_statement::prepare(data_dictionary::database db, cql_stats& stats) {
alter_view_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<alter_view_statement>(*this));
}

View File

@@ -35,7 +35,7 @@ public:
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chunked_vector<mutation>, cql3::cql_warnings_vec>> prepare_schema_mutations(query_processor& qp, const query_options& options, api::timestamp_type) const override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
};
}

View File

@@ -30,7 +30,7 @@ bool attach_service_level_statement::needs_guard(query_processor& qp, service::q
std::unique_ptr<cql3::statements::prepared_statement>
cql3::statements::attach_service_level_statement::prepare(
data_dictionary::database db, cql_stats &stats) {
data_dictionary::database db, cql_stats &stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), ::make_shared<attach_service_level_statement>(*this));
}

View File

@@ -22,7 +22,7 @@ class attach_service_level_statement final : public service_level_statement {
public:
attach_service_level_statement(sstring service_level, sstring role_name);
virtual bool needs_guard(query_processor& qp, service::query_state&) const override;
std::unique_ptr<cql3::statements::prepared_statement> prepare(data_dictionary::database db, cql_stats &stats) override;
std::unique_ptr<cql3::statements::prepared_statement> prepare(data_dictionary::database db, cql_stats &stats, const cql_config& cfg) override;
virtual future<> check_access(query_processor& qp, const service::client_state&) const override;
virtual future<::shared_ptr<cql_transport::messages::result_message>>
execute(query_processor&, service::query_state&, const query_options&, std::optional<service::group0_guard> guard) const override;

View File

@@ -10,7 +10,7 @@
#include "batch_statement.hh"
#include "cql3/util.hh"
#include "raw/batch_statement.hh"
#include "db/config.hh"
#include "cql3/cql_config.hh"
#include "db/consistency_level_validations.hh"
#include "data_dictionary/data_dictionary.hh"
#include <seastar/core/execution_stage.hh>
@@ -195,8 +195,8 @@ void batch_statement::verify_batch_size(query_processor& qp, const utils::chunke
return; // We only warn for batch spanning multiple mutations
}
size_t warn_threshold = qp.db().get_config().batch_size_warn_threshold_in_kb() * 1024;
size_t fail_threshold = qp.db().get_config().batch_size_fail_threshold_in_kb() * 1024;
size_t warn_threshold = qp.get_cql_config().batch_size_warn_threshold_in_kb() * 1024;
size_t fail_threshold = qp.get_cql_config().batch_size_fail_threshold_in_kb() * 1024;
size_t size = 0;
for (auto&m : mutations) {
@@ -242,7 +242,7 @@ future<shared_ptr<cql_transport::messages::result_message>> batch_statement::exe
future<shared_ptr<cql_transport::messages::result_message>> batch_statement::execute_without_checking_exception_message(
query_processor& qp, service::query_state& state, const query_options& options, std::optional<service::group0_guard> guard) const {
cql3::util::validate_timestamp(qp.db().get_config(), options, _attrs);
cql3::util::validate_timestamp(qp.get_cql_config(), options, _attrs);
return batch_stage(this, seastar::ref(qp), seastar::ref(state),
seastar::cref(options), false, options.get_timestamp(state));
}
@@ -441,7 +441,7 @@ void batch_statement::build_cas_result_set_metadata() {
namespace raw {
std::unique_ptr<prepared_statement>
batch_statement::prepare(data_dictionary::database db, cql_stats& stats) {
batch_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
auto&& meta = get_prepare_context();
std::optional<sstring> first_ks;

View File

@@ -197,7 +197,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
if (!db.features().tablet_options) {
throw exceptions::configuration_exception("tablet options cannot be used until all nodes in the cluster enable this feature");
}
db::tablet_options::validate(*tablet_options_map);
db::tablet_options::validate(*tablet_options_map, db.features());
}
if (has_property(KW_STORAGE_ENGINE)) {
@@ -206,9 +206,6 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
if (!db.features().logstor) {
throw exceptions::configuration_exception(format("The experimental feature 'logstor' must be enabled in order to use the 'logstor' storage engine."));
}
if (!db.get_config().enable_logstor()) {
throw exceptions::configuration_exception(format("The configuration option 'enable_logstor' must be set to true in the configuration in order to use the 'logstor' storage engine."));
}
} else {
throw exceptions::configuration_exception(format("Illegal value for '{}'", KW_STORAGE_ENGINE));
}

View File

@@ -78,7 +78,7 @@ seastar::future<shared_ptr<db::functions::function>> create_aggregate_statement:
co_return ::make_shared<functions::user_aggregate>(_name, initcond, std::move(state_func), std::move(reduce_func), std::move(final_func));
}
std::unique_ptr<prepared_statement> create_aggregate_statement::prepare(data_dictionary::database db, cql_stats& stats) {
std::unique_ptr<prepared_statement> create_aggregate_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<create_aggregate_statement>(*this));
}

View File

@@ -24,7 +24,7 @@ namespace functions {
namespace statements {
class create_aggregate_statement final : public create_function_statement_base {
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats) override;
virtual std::unique_ptr<prepared_statement> prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) override;
future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chunked_vector<mutation>, cql3::cql_warnings_vec>> prepare_schema_mutations(query_processor& qp, const query_options& options, api::timestamp_type) const override;
virtual future<> check_access(query_processor& qp, const service::client_state& state) const override;

View File

@@ -54,7 +54,7 @@ create_function_statement::audit_info() const {
return audit::audit::create_audit_info(category(), sstring(), sstring());
}
std::unique_ptr<prepared_statement> create_function_statement::prepare(data_dictionary::database db, cql_stats& stats) {
std::unique_ptr<prepared_statement> create_function_statement::prepare(data_dictionary::database db, cql_stats& stats, const cql_config& cfg) {
return std::make_unique<prepared_statement>(audit_info(), make_shared<create_function_statement>(*this));
}

Some files were not shown because too many files have changed in this diff Show More