- Remove manual gzip header parsing - libdeflate handles all format details
- Rename linearize_chunked_content to build_input_buffer and free chunks as we copy
- Add output chunking to split large decompressed data into 1MB chunks
- Add comment explaining libdeflate's whole-buffer requirement
- Use better initial size heuristic based on compression ratio
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Check if total_decompressed >= length_limit before allocating output buffer
- Prevents allocating a zero-sized buffer when limit is already reached
- Ensures clear error message when limit is exceeded
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Removed unused get_gzip_member_size function
- Rely on libdeflate_gzip_decompress to tell us how many input bytes were consumed
- Added check for zero bytes consumed to detect invalid state
- Simplified the logic by removing unnecessary header size tracking
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
- Created utils/gzip.hh header with ungzip function declaration
- Created utils/gzip.cc implementation using libdeflate
- Updated utils/CMakeLists.txt to include gzip.cc and link libdeflate
- Created comprehensive test suite in test/boost/gzip_test.cc
- Added gzip_test to test/boost/CMakeLists.txt
The implementation:
- Uses libdeflate for high-performance gzip decompression
- Handles chunked_content input/output (vector of temporary_buffer)
- Supports concatenated gzip files
- Validates gzip headers and detects invalid/truncated/corrupted data
- Enforces size limits to prevent memory exhaustion
- Runs in async context to avoid blocking the reactor
Co-authored-by: nyh <584227+nyh@users.noreply.github.com>
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.
scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.
Fixes#27055Closesscylladb/scylladb#27075
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.
Fixes#27086
It has to be backported to 2025.4 as this is the limitation in 2025.4.
Closesscylladb/scylladb#27096
Following 9b6ce030d0 ("sstables: remove quadratic (and possibly
exponential) compile time in parse()"), where we removed recursion
in reading, we do the same here for variadic write. This results
in a small reduction in compile time.
Note the problem isn't very bad here. This is tail-recursion, so likely
removed by the compiler during optimization, and we don't have additional
amplification due to future::then() double-compiling the ready-future
and unready-future paths. Still, better to avoid quadratic compile
times.
Closesscylladb/scylladb#27050
This reverts commit 43738298be.
This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.
Closesscylladb/scylladb#27067Fixes#27047
The updates include:
- adding missing parts like topology states and table rows,
- documenting zero-token nodes,
- replacing the old recovery procedure with the new one.
Fixes#26412
Updates of internal docs (usually read on master) don't require
backporting.
Closesscylladb/scylladb#27022
* github.com:scylladb/scylladb:
docs/dev/topology-over-raft: update the recovery section
docs/dev/topology-over-raft: document zero-token nodes
docs/dev/topology-over-raft: clarify the lack of tablet-specific states
docs/dev/topology-over-raft: add the missing join_group0 state
docs/dev/topology-over-raft: update the topology columns
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.
Fixes https://github.com/scylladb/scylladb/issues/26866
Backport to all supported version since automatic cleanup behaviour as it is now may create unexpected by the operator load during cluster resizing.
Closesscylladb/scylladb#26868
* https://github.com/scylladb/scylladb:
cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
cleanup: Add RESTful API to allow reset cleanup needed flag
The executor::add_stream_options() obtains local database reference from
proxy just to get feature service from it.
Similar chain is used in executor::update_time_to_live().
It's shorter to get features from proxy itself.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26973
…played
Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
Needs backport to all live versions.
Closesscylladb/scylladb#26793
* github.com:scylladb/scylladb:
test: extend test_batchlog_replay_failure_during_repair
db: batchlog_manager: update _last_replay only if all batches were replayed
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the `audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.
Performance testing shows that read query throughput and auth request throughput are consistent even at high reactor utilization. It can also be observed that read query latency increases a bit.
Read query ops = 60k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 777 | 0 |
|table| 801 | +3.09% |
|syslog | 803 | +3.35% |
|table,syslog | 818 | +5.28% |
Read query ops = 50k/s
AUTH ops = 200/s
| Audit Mode | QUERY latency (p99) | Δ% vs none |
|------------|---------------------|------------|
| none | 643 | 0 |
|table| 647 | +0.62% |
|syslog | 648 | +0.78% |
|table,syslog | 656 | +2.02% |
Detailed performance results are in the following Confluence document: [Audit performance impact test](https://scylladb.atlassian.net/wiki/spaces/RND/pages/148308005/Audit+performance+impact+test)
Fixes#26022
Backport:
The decision is to not backport for now. After making sure it works on the latest release, and if there is a need, we can do it.
Closesscylladb/scylladb#26613
* github.com:scylladb/scylladb:
test: dtest: audit_test.py: add AuditBackendComposite
test: dtest: audit_test.py: group logs in dict per audit mode
audit: write out to both table and syslog
audit: move storage helper creation from `audit::start` to `audit::audit`
audit: fix formatting in `audit::start_audit`
audit: unify `create_audit` and `start_audit`
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.
Fixes: VECTOR-187
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26413
* github.com:scylladb/scylladb:
vector_search: Improve vector-store health checking
vector_search: Move response_content_to_sstring to utils.hh
vector_search: Add unit tests for client error handling
vector_search: Enable mocking of status requests
vector_search: Extract abort_source_timeout and repeat_until
vector_search: Move vs_mock_server to dedicated files
Some of the columns were added, but the doc wasn't updated.
`upgrade_state` was updated in only one of the two places.
`ignore_nodes` was changed to a static column.
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.
To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.
The patch should be backported to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26244Closesscylladb/scylladb#26454
* github.com:scylladb/scylladb:
service/storage_service: migrate staging sstables in view building worker during intra-node migration
db/view/view_building_worker: support sstables intra-node migration
db/view_building_worker: fix indent
db/view/view_building_worker: don't organize staging sstables by last token
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.
The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.
This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.
This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.
This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.
This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.
To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.
Backport is not required, it is an improvement
Refs #21776Closesscylladb/scylladb#26444
* github.com:scylladb/scylladb:
boost/repair_test: add repair reader integrity verification test cases
test/lib: allow to disable compression in random_mutation_generator
sstables: Skip checksum and digest reads for unlinked SSTables
table: enable integrity checks for streaming reader
table: Add integrity option to table::make_sstable_reader()
sstables: Add integrity option to create_single_key_sstable_reader
This patch fixes 2 issues at one go:
First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.
Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.
Fixes#26982
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Closesscylladb/scylladb#26991
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Backport: no need, problem existed since very long time, code restructure in https://github.com/scylladb/scylladb/commit/389afcd (and following commits) made
it hitting more often, as _ss was called earlier, but it's not released yet.
Closesscylladb/scylladb#26779
* github.com:scylladb/scylladb:
service: attach storage_service to migration_manager using pluggabe
service: migration_manager: corutinize merge_schema_from
service: migration_manager: corutinize reload_schema
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixes https://github.com/scylladb/scylladb/issues/26976
backport to previous versions since it fixes a bug in MV with vnodes
Closesscylladb/scylladb#27008
* github.com:scylladb/scylladb:
test: add mv write during node join test
topology_coordinator: include joining node in barrier
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.
The PR improves codes readability and efficiency, no behavior changes.
backport: this is a refactoring/optimization, no need to backport
Closesscylladb/scylladb#26907
* https://github.com/scylladb/scylladb:
topology_coordinator: drop unused exec_global_command overload
topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
topology_state_machine: inline get_excluded_nodes
messaging_service: simplify and optimize ban_host
storage_service: topology_state_load: extract topology variable
topology_coordinator: excluded_tablet_nodes -> ignored_nodes
topology_state_machine: add excluded_tablet_nodes field
vector_search: Add backoff for failed nodes
Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.
Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.
References: VECTOR-187.
Backport to 2025.4 as this feature is expected to be available in 2025.4.
Closesscylladb/scylladb#26308
* github.com:scylladb/scylladb:
vector_search: Set max backoff delay to 2x read request timeout
vector_search: Report status check exception via on_internal_error_noexcept
vector_search: Extract client management into dedicated class
vector_search: Add backoff for failed clients
vector_search: Make endpoint available
vector_search: Use std::expected for low-level client errors
vector_search: Extract client class
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).
Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.
A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
Some accesses to `auth::service` are not affected and we do not modify
those.
Fixesscylladb/scylladb#26816
Backport: yes. This is a fix of a regression.
Closesscylladb/scylladb#26856
* github.com:scylladb/scylladb:
test/cluster/test_maintenance_mode.py: Wait for initialization
test: Disable maintenance mode correctly in test_maintenance_mode.py
test: Fix keyspace in test_maintenance_mode.py
service/qos: Do not crash Scylla if auth_integration absent
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.
One test needs adjustment as it relies on RBNO being used for all node
ops.
Fixes: #24664Closesscylladb/scylladb#26330
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test.
Fixesscylladb/scylladb#26686
Backport: impact on the user. We should backport it to 2025.4.
Closesscylladb/scylladb#26729
* github.com:scylladb/scylladb:
tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
db/view/view_building_coordinator: Rate limit logging failed RPC
db/view: Add backoff when RPC fails
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixes https://github.com/scylladb/scylladb/issues/26340
the issue affects all previous releases, backport to improve stability
Closesscylladb/scylladb#26533
* github.com:scylladb/scylladb:
test: test concurrent writes with column drop with cdc preimage
cdc: check if recreating a column too soon
cdc: set column drop timestamp in the future
migration_manager: pass timestamp to pre_create
Migration manager depends on storage service. For instance,
it has a reload_schema_in_bg background task which calls
_ss.local() so it expects that storage service is not stopped
before it stops.
To solve this we use permit approach, and during storage_service
stop:
- we ignore *new* code execution in migration_manager which'd use
storage_service
- but wait with storage_service shutdown until all *existing*
executions are done
Fixesscylladb/scylladb#26734
Since error is not printed to stdout, when working with multiple
files, we don't know whith which sstable the error is associated with.
Closesscylladb/scylladb#27009
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.
Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
In preparation for a new feature, the tests need the ability to make
an endpoint that was previously unavailable, available again.
This is achieved by adding an `unavailable_server::take_socket` method.
This method allows transferring the listening socket from the
`unavailable_server` to the `mock_vs_server`, ensuring they both
operate on the same endpoint.
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.
The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.
This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.
`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.
This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.
Fixes#24428
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26846
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.
However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.
To mitigate that, we rate limit the logging by 1 seconds.
We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.
An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.
Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.
To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.
We provide a reproducer test: it fails before this commit and succeeds
with it.
Fixesscylladb/scylladb#26686
Currently we do not support paging for vector search queries.
When we get such a query with paging enabled we ignore the paging
and return the entire result. This behavior can be confusing for users,
as there is no warning about paging not working with vector search.
This patch fixes that by adding a warning to the result of ANN queries
with paging enabled.
Closesscylladb/scylladb#26384
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.
the test reproduces issue #26340 where this results in a malformed
sstable.
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.
To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.
If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.
While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.
We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.
Fixesscylladb/scylladb#26340
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.
Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.
Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.
Fixes#26369
This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.
Closesscylladb/scylladb#26909
* github.com:scylladb/scylladb:
test: cqlpy: add test case for non-numeric PERCENTILE value
schema: speculative_retry: update exception type for sstring ops
docs: cql: ddl.rst: update speculative-retry-options
test: cqlpy: add test for valid speculative_retry values
schema: speculative_retry: allow 0 and 100 PERCENTILE values
This method is specific to topology requests -- node joining, replacing,
decommissioning etc, everything that goes through
topology::transition_state::write_both_read_old and
raft_topology_cmd::command::stream_ranges. It shouldn't be used in
other contexts -- to handle global topology requests
(e.g. truncate table) or for tablets. Rename the method to make this
more explicit.
The method is specific to topology_coordinator, which already contains
a wrapper for it, so inline the topology method into it.
Also, make the logic of the method more explicit and remove multiple
transition_nodes lookups.
Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch.
Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification.
Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.
Adds a compress flag to random_mutation_generator, allowing tests to disable compression in generated mutations.
When set to compress::no, the schema builder uses no_compression() parameters.
Add an _unlinked flag to track SSTable unlink state and check it in
read_digest() and read_checksum() methods to skip file reads for
unlinked SSTables, preventing potential file not found errors.
Add a test that reproduces the issue scylladb/scylladb#26976.
The test adds a new node with delayed group0 apply, and does writes with
MV updates right after the join completes on the coordinator and while
the joining node's state is behind.
The test fails before fixing the issue and passes after.
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.
However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:
1. On the topology coordinator, the join completes and the joining node
becomes normal, but the joining node's state lags behind. Since it's
not synchronized by the barrier, it could be in an old state such as
`write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
new replica.
3. The new node applies the base mutation but doesn't generate a view
update for it, because it calculates the base-view pairing according
to its own state and replication map, and determines that it doesn't
participate in the base-view pairing.
Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.
Fixesscylladb/scylladb#26976
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:
```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```
To avoid that, we should wait until initialization is complete.
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.
Update _last_replay only if all batches were replayed.
Fixes: https://github.com/scylladb/scylladb/issues/24415.
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.
The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.
We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.
When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.
The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.
Fixes https://github.com/scylladb/scylladb/issues/26405
backport not needed - enhancement
Closesscylladb/scylladb#24960
* github.com:scylladb/scylladb:
test: cdc: test cdc compatible schema
cdc: use compatiable cdc schema
db: schema_applier: create schema with pointer to CDC schema
db: schema_applier: extract cdc tables
schema: add pointer to CDC schema
schema_registry: remove base_info from global_schema_ptr
schema_registry: use extended_frozen_schema in schema load
schema_registry: replace frozen_schema+base_info with extended_frozen_schema
frozen_schema: extract info from schema_ptr in the constructor
frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.
The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.
Fixes#26584Closesscylladb/scylladb#26609
* github.com:scylladb/scylladb:
Add cluster tests for checking scoped primary_replica_only streaming
Improve choice distribution for primary replica
Refactor cluster/object_store/test_backup
nodetool restore: add primary-replica-only option
nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
Enable scoped primary replica only streaming
Support primary_replica_only for native restore API
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:
```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```
This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.
This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.
A version containing this bug has not been released yet, so no backport is needed.
Closesscylladb/scylladb#26946
* github.com:scylladb/scylladb:
load_stats: add test for migrate_tablet_size()
load_stats: fix problem with tablet size migration
cql3: Fix std::bad_cast when deserializing vectors of collections
This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.
The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.
To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.
Fixes: #26704
This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.
Closesscylladb/scylladb#26740
* github.com:scylladb/scylladb:
cql3: Make abstract_type explicitly noncopyable
cql3: Fix std::bad_cast when deserializing vectors of collections
ignored_nodes is sufficient in these cases. excluded_tablet_nodes
also includes left_nodes_rs, which are not needed
here — global_token_metadata_barrier runs the barrier only
on normal and transition nodes, not on left nodes.
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.
Recently we enabled tablets by default, but it is necessary to
enable rf_rack_valid_keyspaces if materialized views are to be used
with tablets, and this option is *not* the default.
We did add this option in test/pylib/scylla_cluster.py which is
used by test.py, but we didn't add it to test/cqlpy/run.py, so
the test/cqlpy/run script is no longer able to run tests with
materialized views. So this patch adds the missing configuration
to run.py.
FIxes#26918
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26919
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.
This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.
Closesscylladb/scylladb#26934
* github.com:scylladb/scylladb:
encryption::kms_host: Add exponential backoff-retry for 503 errors
encryption::kms_host: Include http error code in kms_error
Refs #26822
AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.
Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.
v2:
* Use utils::exponential_backoff_retry
This patch fixes a bug with tablet size migration in load_stats.
has_tablet_size() lambda in topology_coordinator::migrate_tablet_size()
was returning false in all cases due to incorrect search iterator
comparison after a table and tablet saeach.
This change moves load_stats migrate_tablet_sizes() functionaility
into a separate method of load_stats.
Add --blocked-reactor-notify-ms argument to allow overriding the default
blocked reactor notification timeout value of 25 ms.
This change provides users the flexibility to customize the reactor
notification timeout as needed.
Fixes: scylladb/scylla-enterprise#5525Closesscylladb/scylladb#26892
In the test translated from Cassandra validation/operations/alter_test.py
we had two lines in the beginning of an unrelated test that verified
that CREATE KEYSPACE is not allowed without replication parameters.
But starting recently, ScyllaDB does have defaults and does allow these
CREATE KEYSPACE. So comment out these two test lines.
We didn't notice that this test started to fail, because it was already
marked xfail, because in the main part of this test, it reproduces a
different issue!
The annoying side-affect of these no-longer-passing checks was that
because the test expected a CREATE KEYSPACE to fail, it didn't bother
to delete this keyspace when it finished, which causes test.py to
report that there's a problem because some keyspaces still exist at the
end of the test. Now that we fixed this problem, we no longer need to
list this test in test/cqlpy/suite.yaml as a test that leaves behind
undeleted keyspaces.
Fixes#26292
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26341
The migration manager offers some free functions to prepare mutations
for a new/updated table/view. Most of them include a validation check
for the schema extensions, but in the following ones it's missing:
* `prepare_new_column_family_announcement` (overload with vector as out parameter)
* `prepare_new_column_families_announcement`
Presumably, this was just an omission. It's also not a very important
one since the only extension having validation logic is the
`encryption_schema_extension`, but none of these functions is connected
to user queries where encryption options can be provided in the schema.
User queries go through the other
`prepare_new_column_family_announcement` overload, which does perform a
validation check.
Add validation in the missing places.
Fixes#26470.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#26487
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.
We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.
As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.
Fixes https://github.com/scylladb/scylladb/issues/22463
Backport to 2025.4, it's important not to delay phasing out vnodes.
Closesscylladb/scylladb#26836
* github.com:scylladb/scylladb:
test,alternator: use 3-rack clusters in tests
alternator: improve error in tablets_mode_for_new_keyspaces=enforced
config: make tablets_mode_for_new_keyspaces live-updatable
alternator: improve comment about non-hidden system tags
alternator: Fix test_ttl_expiration_streams()
alternator: Fix test_scan_paging_missing_limit()
alternator: Don't require vnodes for TTL tests
alternator: Remove obsolete test from test_table.py
alternator: Fix tag name to request vnodes
alternator: Fix test name clash in test_tablets.py
alternator: test_tablets.py handles new policy reg. tablets
alternator: Update doc regarding tablets support
alternator: Support `tablets_mode_for_new_keyspaces` config flag
Fix incorrect hint for tablets_mode_for_new_keyspaces
Fix comment for tablets_mode_for_new_keyspaces
`test -v` isn't present on the MacOS shell. Since dbuild is intended
as a compatibility bridge between the host environment and the build
environment, don't use it there.
Use ${var+text_if_set} expansion as a workaround.
Fixes#26937Closesscylladb/scylladb#26939
Add a test to check that paged secondary index queries behave correctly
when pages are short. This is currently failing in Scylla, but passes in
Cassandra 5, therefore marked as "xfailing". Refer to the test's
docstring for more details.
The bug is a regression introduced by commit f6f18b1.
`test/cqlpy/run --release ...` shows that the test passes in 5.1 but
fails in 5.2 onwards.
Refs #25839.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
Closesscylladb/scylladb#25843
This commits adds a tests checking various scenarios of restoring
via load and stream with primary_replica_only and a scope specified.
The tests check that in a few topologies, a mutation is replicated
a correct amount of times given primary_replica_only and that
streaming happens according to the scope rule passed.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.
This patch sorts the replica set before passing it through the
scope filter.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This PR splits the suppport code from test_backup.py
into multiple functions so less duplicated code is
produced by new tests using it. It also makes it a bit
easier to understand.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Add --primary-replica-only and update docs page for
nodetool restore.
The relationship with the scope parameter is:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.
This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.
Fixes#26584
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.
To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.
In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:
* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.
For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.
The other uses of `auth_integration` within the service level controller
are not problematic:
* `find_effective_service_level`,
* `find_cached_effective_service_level`.
They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.
Fixesscylladb/scylladb#26816
Refactoring scylla-ci to be triggered directly from each PR using GitHub action. This will allow us to skip triggering CI when PR commit message was updated (which will save us un-needed CI runs) Also we can remove `Scylla-CI-route` pipeline which route each PR to the proper CI job under the release (GitHub action will do it automatically), to reduce complexity
Fixes: https://scylladb.atlassian.net/browse/PKG-69Closesscylladb/scylladb#26799
* Adds test fixture for AWS KMS
* Adds test fixture for Azure KMS
* Adds key provider proxy for Azure to pytests (ported dtests)
* Make test gather for boost tests handle suites
* Fix GCP test snafu
Fixes#26781Fixes#26780Fixes#26776Fixes#26775Closesscylladb/scylladb#26785
* github.com:scylladb/scylladb:
gcp_object_storage_test: Re-enable parallelism.
test::pylib: Add azure (mock) testing to EAR matrix
test::boost::encryption_at_rest: Remove redundant azure test indent
test::boost::encryption_at_rest: Move azure tests to use fixture
test::lib: Add azure mock/real server fixture
test::pylib::boost: Fix test gather to handle test suites
utils::gcp::object_storage: Fix typo in semaphore init
test::boost::encryption_at_rest_test: Remove redundant indent
test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
test::boost::test_encryption_at_rest: Reorder tests and helpers
ent::encryption: Make text helper routines take std::string
test::pylib::dockerized_service: Handle docker/podman bind error message
test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
test::lib::gcs_fixture: Only set port if running docker image + more retry
worker during intra-node migration
Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage
This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.
Fixesscylladb/scylladb#26244
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.
Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.
docs/alternator/compatibility.md describes support for global (multi-DC)
tables, and suggests that the CQL command "ALTER TABLE" should be used
to change the replication of an Alternator table. But actually, the
right command is "ALTER KEYSPACE", not "ALTER TABLE". So fix the
document.
Fixes#26737Closesscylladb/scylladb#26872
Add `AuditBackendComposite`, a test class which allows testing multiple
audit outputs in a single run, implemented in `audit_composite_storage_helper`
class.
Add two more tests.
`test_composite_audit_type_invalid` tests if an invalid audit mode among
correct ones causes the same error as when it is the only specified audit mode.
`test_composite_audit_empty_settings` tests if `'none'` audit mode, when
specified along other audit modes, properly disables audit logging.
Refs #26022
Before this patch audit test could process audit logs from a single
audit output. This patch adds support for multiple audit outputs
in the same run. The change is needed in order to test
`audit_composite_storage_helper`, which can write to multiple
audit outputs.
Refs #26022
This patch adds support for multiple audit log outputs.
If only one audit log output is enabled, the behavior does not change.
If multiple audit log outputs are enabled, then the
`audit_composite_storage_helper` class is used. It has a collection
of `storage_helper` objects.
Fixes#26022
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.
Refs #26369
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304
Refs #26369
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)
Refs #26369
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.
Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.
Refs #26369
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.
On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.
Fixes#26369
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:
An error occurred (InternalServerError) when calling the CreateTable
operation: ... Replication factor 3 exceeds the number of racks (1) in
dc datacenter1
So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.
Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.
We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.
For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.
In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.
That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.
In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.
Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).
Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.
After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.
Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.
Fixes#22463
Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter.
Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it.
Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot.
Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586)
Improving existing working test, no backport is needed.
Closesscylladb/scylladb#26693
* github.com:scylladb/scylladb:
database_test: Simplify snapshot_with_quarantine_works() test
database_test: Improve snapshot_skip_flush_works test
database_test: Simplify snapshot_works() tests
database_test: Use collect_files() to remove files
database_test: Use collectz_files() to count files in directory
database_test: Introduce collect_files() helper
_last_key is a multi-fragment buffer.
Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.
The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.
But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).
Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.
Fix that.
We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.
Fixesscylladb/scylladb#26819Closesscylladb/scylladb#26839
In the present scenario, there are issues in left_token_ring transition state
execution in the decommissioning path. In case of concurrent mutation race
conditions, we enter left_token_ring more than once, and apparently if
we enter left token ring second time, we try to barrier the decommisioned
node, which at this point is no longer possible. That's what causes the errors.
This pr resolves the issue by adding a check right in the start of
left_token_ring to check if the first topology state update, which marks
the request as done is completed. In this case, its confirmed that this
is the second time flow is entering left_token_ring and the steps preceding
the request status update should be skipped. In such cases, all the rest
steps are skipped and topology node status update( which threw error in
previous trial) is executed directly. Node removal status from group0 is
also checked and remove operation is retried if failed last time.
Although these changes are done with regard to the decommission operation
behavior in `left_token_ring` transition state, but since the pr doesn't
interfere with the core logic, it should not derail any rollback specific
logic. The changes just prevent some non-idempotent operations from
re-occuring in case of failures. Rest of the core logic remain intact.
Test is also added to confirm the proper working of the same.
Fixes: scylladb/scylladb#20865
Backport is not needed, since this is not a super critical bug fix.
Closesscylladb/scylladb#26717
Fixes: #26440
1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive
Signed-off-by: Ran Regev <ran.regev@scylladb.com>
Closesscylladb/scylladb#26480
The variable was unused since cae999c094 ("toolchain: change
optimized clang install method to standard one"), and now causes
the differential shellcheck continuous integration test to fail whenever
it is changed. Remove it.
Closesscylladb/scylladb#26796
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.
Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).
In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.
The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.
Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)
backport: need to backport to 2025.4 (LWT for tablets release)
Closesscylladb/scylladb#26827
* https://github.com/scylladb/scylladb:
storage_proxy: use coroutine::maybe_yield();
storage_proxy: use gates to track write handlers destruction
* tools/cqlsh f852b1f5...19445a5c (2):
> Update scylla-driver version to 3.29.4
Update tools/cqlsh submodule for scylla-driver 3.29.4
The motivation for this update is to resolve a driver-side serialization bug that was blocking work on #26740. The bug affected vector<collection> types (e.g., vector<set<int>,1>) and is fixed in scylla-driver versions 3.29.2+.
Refs #26704
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.
Fixes#26505Closesscylladb/scylladb#26502
Extract storage helper creation into `create_storage_helper` function.
Call this function from `audit::audit`. It will be called per shard inside
`sharded<audit>::start` method.
Refs #26022
There is no need to have `create_audit` separate from `start_audit`.
`create_audit` just stores the passed parameters, while `start_audit`
does the actual initialization and startup work.
Refs #26022
Re-enable parallel execution to get better logs.
Note, this is somewhat wasteful, as we won't re-use test fixture here,
but in the end, it is probably an improvement.
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")
If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`
In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.
Fixes https://github.com/scylladb/scylladb/issues/26523Closesscylladb/scylladb#26635
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.
Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.
In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.
The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.
Fixesscylladb/scylladb#26788
In pull request #26384 a discussion started whether page_size=0 really
disables paging, or maybe one needs page_size=-1 to truly disable paging.
The reason for that discussion was commit 08c81427b that started to
use page_size=-1 for internal unpaged queries, and commit 76b31a3 that
incorrectly claimed that page_size>=0 means paging is enabled.
This patch introduces a test that confirms that with page_size=0, paging
is truly disabled - including the size-based (1MB) paging.
The new test is Scylla-only, because Cassandra is anyway missing the
size-based page cutoff (see CASSANDRA-11745).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26742
Introduced in 9ebdeb2
The problem is specific to node replacing and rack-list RF. The
culprit is in the part of the load balancer which determines rack's
shard count. If we're replacing the last node, the rack will contain
no normal nodes, and shards_per_rack will have no entry for the rack,
on which the table still has replicas. This throws std::out_of_range
and fails the tablet draining stage, and node replace is failed.
No backport because the problem exists only on master.
Fixes#26768Closesscylladb/scylladb#26783
Python 3.14 changed the multiprocessing fork mode to "forkserver",
presumably for good reasons. However, it conflicts with our
relocatable Python system. "forkserver" forks and execs a Python
process at startup, but it does this without supplying our relocated
ld.so. The system ld.so detects a conflict and crashes.
Fix this by switching back to "fork", which is sufficient for
housekeeping's modest needs.
Closesscylladb/scylladb#26831
The test collects Data files from table dir, then _all_ files from
snapshot dir and then checks whether the former is the subset of the
latter. Using std::includes over two sets makes the code much shorter.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It has two inaccuracies.
First, when checking the contents of table directory, it uses
pre-populated expected list with "manifest.json" in it. Weird.
Second, when cechking the contents of snapshot directory it doesn't
check if the "schema.cql" is there. It's always there, but if something
breaks in the future it may come unnoticed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
No functional changes here, just make use of the new lister to shorten
the code. A small side effect -- if the test fails because contents of
directories changes, it will print the exact difference in logs, not
just that N files are missing/present.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some test cases remove files from table directory to perform some checks
over the taken snapshots. Using collect_files() helper makes the code
easier to read.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Some test cases want to see that there are more than one file in a
directory, so they can just re-use the new helper. Much shorter this
way.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It returns a set of files in a given directoy. Will be used by all next
patches.
Implemented using directory_lister, not lister::scan_dir in order to
help removing the latter one in the future.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Fixes#26781
Makes the test independent of wrapping scripts. Note: retains the
split into "real" and "mock" tests. For other tests, we either all
mock, or allow the environment to select mock or real. Here we have
them combined. More expensive, but otoh more thourough.
Wraps the real/mock azure server for test in a fixture.
Note: retains the current test setup which explicitly runs
some tests with "real" azure, if avail, and some always mock.
Runs local-kms mock AWS KMS server unless overridden by env var.
Allows tests to use real or fake AWS KMS endpoint and shared fixture
for quicker execution.
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.
Fixes https://github.com/scylladb/scylladb/issues/26691Closesscylladb/scylladb#26825
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.
Fixes: https://github.com/scylladb/scylladb/issues/26715.
Requires backports to all live versions
Closesscylladb/scylladb#26746
* github.com:scylladb/scylladb:
api: storage_service: tasks: unify upgrade_sstable
api: storage_service: tasks: force_keyspace_cleanup
api: storage_service: tasks: unify force_keyspace_compaction
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.
This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization` from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.
The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages, the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.
Fixes#25308
This is a feature that users need, so it should probably be backported to live branches.
Closesscylladb/scylladb#25457
* github.com:scylladb/scylladb:
docs/alternator: explain alternator_warn_authorization
test/alternator: tests for new auth failure metrics and log messages
alternator: add alternator_warn_authorization config
There's a test that checks if temporary-statistics file is gone at some
point. It does it by listing the directory it expects the file to be in
and then comparing the names met with the temp. stat. file name.
It looks like a single file_exists() call is enough for that purpose.
As a "sanity" check this patch adds a validation that non-temporary
statistics file is there, all the more so this file is removed after the
test.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26743
Support the counters feature in tablets keyspaces.
The main change is to fix the counter update during tablets intranode migration.
Counter cell is c = map<host_id, value>. A counter update is applied by doing read-modify-write on a leader replica to retrieve the current host's counter value and transform the mutation to contain the updated value for the host, then apply the mutation and replicate it to other hosts. the read-modify-write is protected against concurrent updates by locking the counter cell.
When the counter is migrated between two shards, it's not enough to lock the counter on the read shard, because in the stage write_both_read_new the read shard is switched, and then we can have concurrent updates reach either the old or the new shard. In order to keep the counter update exclusive we lock both shards when in the stage write_both_read_new.
Also, when applying the transformed mutation we need to respect write_both stages and apply the mutation on both shards. We change it to use `apply_on_shards` similarly to other methods in storage proxy.
The change applies to both tablets and vnodes, they use the same implementation, but for vnodes the behavior should remain equivalent up to some small reordering of the code since it doesn't have intranode migration and reduces to single read shard = write shard.
Fixes https://github.com/scylladb/scylladb/issues/18180
no backport - new feature
Closesscylladb/scylladb#26636
* github.com:scylladb/scylladb:
docs: counters now work with tablets
pgo: enable counters with tablets
test: enable counters tests with tablets
test: add counters with tablets test
cql3: remove warning when creating keyspace with tablets
cql3: allow counters with tablets
storage_proxy: lock all read shards for counter update
storage_proxy: apply counter mutation on all write shards
storage_proxy: move counter update coordination to storage proxy
storage_proxy: refactor mutate_counter_on_leader
replica/db: add counter update guard
replica/db: split counter update helper functions
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.
The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.
Fixes#22707
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#26730
Counters are now supported in tablet-enabled keyspaces, so remove
the documentation that listed counters as an unsupported feature
and the note warning users about the limitation.
Enable all counters-related tests that were disabled for tablets because
counters was not supported with tablets until now.
Some tests were parametrized to run with both vnodes and tablets, and
the tablets case was skipped, in order to not lose coverage. We change
them to run with the default configuration since now counters is
supported with both vnodes and tablets, and the implementation is the
same, so there is no benefit in running them with both configurations.
add a new test for counters with tablets to test things that are
specific to tablets. test counter updates that are concurrent with
tablet internode and intranode migrations and verify it remains
consistent and no updates are lost.
When creating a keyspace with tablets, a warning is shown with all the
unsupported features for tablets, which is only counters currently.
Now that counters is also supported with tablets, we can remove this
warning entirely.
Now that counters work with tablets, allow to create a table with
counters in a tablets-enabled keyspace, and remove the warning about
counters not being supported when creating a keyspace with tablets.
We allow to use counters with tablets only when all nodes are upgraded
and support counters with tablets. We add a new feature flag to
determine if this is the case.
Fixesscylladb/scylladb#18180
Previously in a counter update we lock the read shard to protect the
counter's read-modify-write against concurrent updates.
This is not sufficient when the counter is migrated between different
shards, because there is a stage where the read shard switches from the
old shard to the new shard, and during that switch there can be
concurrent counter updates on both shards. If each shard takes only its
own lock, the operations will not be exclusive anymore, and this can
cause lost counter updates.
To fix this, we acquire the counter lock on both shards in the stage
write_both_read_new, when both shards can serve reads. This guarantees
that counter updates continue to be exclusive during intranode
migration.
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.
This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.
Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica
In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.
After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
Slightly reorganize the mutate counter function to prepare it for a
later change.
Move the code that finds the read shard and invokes the rest of the
function on the read shard to the caller function. This simplifies the
function mutate_counter_on_leader_and_replicate which now runs on the
read shard and will make it easier to extend.
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.
This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
parse() taking a list of elements is quadratic (during compile time) in
that it generates recursive calls to itself, each time with one fewer
parameter. The total size of the parameter lists in all these generated
functions is quadratic in the initial parameter list size.
It's also exponential if we ignore inlining limits, since each .then()
call expands to two branches - a ready future branch and a non-ready
future branch. If the compiler did not give up, we'd have 2^list_len
branches. For sure the compiler does not do so indefinitely, but the effort
getting there is wasted.
Simplify by using a fold expression over the comma operator. Instead
of passing the remaining parameter list in each step, we pass only
the parameter we are processing now, making processing linear, and not
generating unnecessary functions.
It would be better expressed using pack expansion statements, but these
are part of C++26.
The largest offender is probably stats_metadata, with 21 elements.
dev-mode sstables.o:
text data bss dec hex filename
1760059 1312 7673 1769044 1afe54 sstables.o.before
1745533 1312 7673 1754518 1ac596 sstables.o.after
We save about 15k of text with presumably a corresponding (small)
decrease in compile time.
Closesscylladb/scylladb#26735
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0
Fixesscylladb/scylladb#26801Closesscylladb/scylladb#26809
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).
Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.
This patch introduces this kind of API:
```
nodetool excludenode <host-id> [ ... <host-id> ]
```
Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.
Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.
Fixes#21281
The PR also changes "nodetool status" to show excluded nodes,
they have 'X' in their status instead of 'D'.
Closesscylladb/scylladb#26659
* github.com:scylladb/scylladb:
nodetool: status: Show excluded nodes as having status 'X'
test: py: Test scenario involving excludenode API
nodetool: Introduce excludenode command
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.
Adopting newer seastar API, no need to backport
Closesscylladb/scylladb#26747
* github.com:scylladb/scylladb:
commitlog: Remove unused work::r stream variable
ec2_snitch: Fix indentation after previous patch
ec2_snitch: Coroutinize the aws_api_call_once()
sstable: Construct output_stream for data instantly
test: Don't reuse on-stack input stream
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).
Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.
Fixes#26610.
Closesscylladb/scylladb#26697
* github.com:scylladb/scylladb:
test/cluster: Add test for default SSTable compressor
db/config: Change default SSTable compressor to LZ4WithDictsCompressor
db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
boost/cql_query_test: Get expected compressor from config
The compaction manager backlog is exposed via metrics, but if static
shares are set, the backlog is never calculated. As a result, there is
no way to determine the backlog and if the static shares need
adjustment. Fix that by calculating backlog even when static shares are
set.
Fixes#26287
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26778
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.
a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.
Fixesscylladb/scylladb#26791Closesscylladb/scylladb#26792
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.
This is the second part of the size based load balancing changes:
- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254
This is a new feature and backport is not needed.
Closesscylladb/scylladb#26152
* github.com:scylladb/scylladb:
load_balancer: load_stats reconcile after tablet migration and table resize
load_stats: change data structure which contains tablet sizes
If a node is dead and cannot be brought back, tablet migrations are
stuck, until the node is explicitly marked as "permanently dead" /
"ignored node" / "excluded" (name differs in different contexts).
Currently, this is done during removenode and replace operations but
it should be possible to only mark the node as dead, for the purpose
of unblocking migrations or other topology operations, without doing
the actual removenode, because full removal might be currently
impossible, or not desirable due to lack of capacity or priorities.
This patch introduces this kind of API:
nodetool excludenode <host-id> [ ... <host-id> ]
Having this kind of API is an improvement in user experience in
several cases. For example, when we lose a rack, the only viable
option for recovery is to run removenode with an extra
--ignore-dead-nodes option. This removenode will fail in the tablet
draining phase, as there is no live node in the rack to rebuild
replicas in. This is confusing to the operator. But necessary before
ALTER KEYSPACE can proceed in order to change replication options to
drop the rack from RF.
Having this API allows operators to have more unified procedures,
where "nodetool excludenode" is always the first step of recovery,
which unblocks further topology operations, both those which restore
capacity, but also auto-scaling, tablet split/merge, load balancing,
etc.
Fixes#21281
We want to move towards rack-list based replication factor for tablets being the default mode, and in the future the only supported mode. This PR is a step towards that. We auto-expand numeric RF to rack list on keyspace creation and ALTER when rf_rack_valid_keyspaces option is enabled.
The PR is mostly about adjusting tests. The main logic change is in the last patch, which modifies option post-processing in ks_prop_defs.
Fixes#26397Closesscylladb/scylladb#26692
* github.com:scylladb/scylladb:
cql3: ks_prop_defs: Expand numeric RF to rack list
locator: Move rack_list to topology.hh
alternator: Do not set RF for zero-token DCs
alternator: Switch keyspace creation to use ks_prop_defs
test: alternator: Adjust for rack lists
cql3: Move validation of invalid ALTER KEYSPACE earlier, to ks_prop_defs
test: cqlpy: Mark tests using rack lists as scylla-only
test: Switch to rack-list based RF
test: Generalize tests to work with both numeric RF and rack lists
test: cluster: test_zero_token_nodes_multidc: Adjust to rack list RF
test: Prepare for handling errors specific to rack list path
test: cluster: dtest: alternator: Force RF=1 in test_putitem_contention
test: Create cluster with multiple racks in multi-dc setups
test: boost: network_topology_strategy_test: Adjust to rack-list RF
test: tablets: Adjust to rack list
test: cluster: test_group0_schema_versioning: Use smaller RF to respect rf-rack-validness
test: tablets_test: Convert test_per_shard_goal_mixed_dc_rf to be rack-valid
test: object_store: test_backup: Adjust for rack lists
test: cluster: tablets: Do not move tablet across racks in test_tablet_transition_sanity
test: cluster: mv: Do not move tablets across racks
test: cluster: util: Fix docstring for parse_replication_options()
tablets, topology_coordinator: Skip tablet draining on replace
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.
This PR fixes the issue by tagging major compaction tasks with a newly
introduced `compaction_type::Major` enum. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.
Fixes#24501
PR improves how the compactions are stopped when a disable auto compaction request is executed.
No need to backport
Closesscylladb/scylladb#26288
* github.com:scylladb/scylladb:
replica/table: do not stop major compaction when disabling auto compaction
compaction/compaction_descriptor: introduce compaction_type::Major
The previous patch made the default compressor dependent on the
SSTABLE_COMPRESSION_DICTS feature:
* LZ4Compressor if the feature is disabled
* LZ4WithDictsCompressor if the feature is enabled
Add a test to verify that the cluster uses the right default in every
case.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).
Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.
If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
We adjust the test to RF-rack-validity and then re-enable
index random events, which requires the configuration option
`rf_rack_valid_keyspaces` to be enabled.
Fixesscylladb/scylladb#26422
Backport: I'd rather not backport these changes. They're almost a hack and poses too much risk for little gain.
Closesscylladb/scylladb#26591
* github.com:scylladb/scylladb:
test/cluster/random_failures: Re-enable index events
test/cluster/random_failures: Enable rf_rack_valid_keyspaces
test/cluster/random_failures: Adjust to RF-rack-validity
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.
Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).
Unify the handlers of both apis.
Auto-exands numeric RF in CREATE/ALTER KEYSPACE statements for
new DCs specified in the statement.
Doesn't auto-expand existing options, as the rack choice may not be in
line with current replica placement. This requires co-locating tablet
replicas, and tracking of co-location state, which is not implemented yet.
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
So that we get the same validation and option post-processing as
during regular keyspace creation.
RF auto-expansion logic happens in ks_prop_defs, and we want that
for tablets.
To achieve RF=3 with tablets and rf_rack_valid_keyspaces, we need 3
racks. So change the test to create 3 racks. Alternator was bypassing
standard keyspace creation path, so it escaped validation. But this
will change, and the test will stop wroking.
Also, after auto-expansion of RF to rack list, not all of 4 nodes
will host replicas. So need to adjust expectations.
Tests expect this failure in some scenarios, but later changes make us
fail ealier due to topology constraints.
As a rule, more general validation should come before more specific
validation. So syntax validation before topology validation.
Have to do that before we enable auto-expansion of numeric RF to
rack-lists, because those tests alter the replication factor, and
altering from rack-list to numeric will not be allowed.
Two changes here:
1) Allocate nodes in dc2 in separeate racks to make the test stronger
- it invites bugs where RF==nr_racks succeeds despite there being
zero-token nodes, and not simply fail due to rack count.
2) Due to auto-expansion to rack list, scylla throws in keyspace
creation rather than table creation.
With rf_rack_valid_keyspaces enabled, RF of alternator tables will be
equal to the number of racks (in this test: nodes). Prior to that, if
number of nodes is smaller than 3, alternator creates the keyspace
with RF=1. Turns out, with RF=2 the test fails with write timeouts due
to contention. Enforce RF=1 by creating the table with one node before
adding the second node.
test_decommission_rack_load_failure expects some tablets to land in
the rack which only has the decommissioning node. Since the table uses
RF=1, auto-expansion may choose the other rack and put all tablets
there, and the expected failure will not happen. Force placement by
using rack-list RF.
Choose old_replica and new_replica so that they're both in rack r1.
After later changes (rack list auto expansion), it's no longer
guaranteed that the first replica will be on r1.
Replace doesn't drain (rebuild) tablets during topology change. They
are rebuilt afterwards when the replaced node is in "left" state and
replacing node is in normal state. So there is no point in attempting
to drain, as nothing will be drained.
Not only that, doing so has a risk, because the load balancer is
invoked on a transitional topology state in which we can end up with
no normal nodes in a rack. That's the case if the replaced node was
the last one in the rack. This tripped one of the algorithms which
computes rack's shard count for the purpose of determining ideal
tablet count, it was not prepared to find an empty rack to which a
table is still repliacated. That was fixed separately, but to avoid
this, we better skip tablet draining here.
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435
As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.
Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.
For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
In commit a3ec6c7d1d we supposedly
implemented the feature of telling TTL experation events from regular
user-sent deletions. However, that implementation did not actually work
at all... It had two bugs:
1. It created an null rjson::value() instead of an empty dictionary
rjson::empty_object(), so GetRecords failed every time such a
TTL expiration event was generated.
2. In the output, it used lowercase field names "type" and "principalId"
instead of the uppercase "Type" and "PrincipalId". This is not the
correct capitalization, and when boto3 recieves such incorrect
fields it silently deletes them and never passes them to the user's
get_records() call.
This patch fixes those two bugs, and importantly - enables a test for
this feature. We did already have such a test but it was marked as
"veryslow" so doesn't run in CI and apparently not even run once to
check the new feature. This test is not actually very long on Alternator
when the TTL period is set very low (as we do in our tests), so I replaced
the "veryslow" marker by "waits_for_expiration". The latter marker means
that the test is still very slow - as much as half an hour - on DynamoDB -
but runs quickly on Scylla in our test setup, and enabled in CI by
default.
The enabled test failed badly before this patch (a server error during
GetRecords), and passes with this patch.
Also, the aforementioned commit forgot to remove the paragraph in
Alternator's compatibility.md that claims we don't have that feature yet.
So we do it now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26633
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.
Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.
Fixes https://github.com/scylladb/scylladb/issues/26382Closesscylladb/scylladb#26578
cql3: Refactor vector search select impl into a dedicated class
The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500.
This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization.
Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes.
To address this, the refactoring introduces the following changes:
- A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk.
- The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes.
- An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future.
Fixes: VECTOR-162
No backport is needed, as this is refactoring.
Closesscylladb/scylladb#25798
* github.com:scylladb/scylladb:
cql3: Rename indexed_table_select_statement
cql3: Move vector search select to dedicated class
When auto compaction is disabled, all ongoing compactions, including
major compactions, are stopped. However, major compactions should not be
stopped, since the disable request applies only to regular auto
compactions.
This patch fixes the issue by tagging major compaction tasks with the
newly introduced `compaction_type::MajorCompaction`. Since
`table::disable_auto_compaction()` already requests the compaction
manager to stop only tasks of type `compaction_type::Compaction`, major
compactions will no longer be stopped.
Fixes#24501
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce a new compaction_type enum : `Major`.
This type will be used by the next patches to differentiate between
major compaction and regular compaction (compaction_type::Compaction).
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Since 5b6570be52, the default SSTable compression algorithm for user
tables is no longer hardcoded; it can be configured via the
`sstable_compression_user_table_options.sstable_compression` option in
scylla.yaml.
Modify the `test_table_compression` test to get the expected value from
the configuration.
Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixes https://github.com/scylladb/scylladb/issues/26732
backport to 2025.4 where cdc with tablets is introduced
Closesscylladb/scylladb#26160
* github.com:scylladb/scylladb:
test: cdc: extend cdc with tablets tests
cdc: improve cdc metadata loading
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.
To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.
We also add a test that checks this behavior.
Fixes: VECTOR-254
Fixes: #26672Closesscylladb/scylladb#26508
This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR.
backport: not needed, this PR contains only renames and code comment fixes
Closesscylladb/scylladb#26745
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: fix comment
storage_proxy: remove stale comment
storage_proxy: improve run_fenceable_write comment
topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes
storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed
storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
The previous patches added the ability to set
alternator_warn_authorization. In this patch we add to our
documentation a recommendation that this setting be used as an
intermediate step when wanting to change alternator_enforce_authorization
from "false" to "true". We explain why this is useful and important.
The new documentation is in docs/alternator/compatibility.md, where
we previously explained the alternator_enforce_authorization configuration.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds to test_metrics.py tests that authentication and
authorization errors increment, respectively, the new metrics
scylla_alternator_authentication_failures
scylla_alternator_authorization_failures
This patch also adds in test_logs.py tests that verify that that log
messages are generated on different types of authentication/authorization
failures.
The tests also check how configuring alternator_enforce_authorization
and alternator_warn_authorization changes these behaviors:
* alternator_enforce_authorization determines whether an auth error
will cause the request to fail, or the failure is counted but then
ignored.
* alternator_warn_authorization determines whether an auth error will
cause a WARN-level log message to be generated (and also the failure
is counted.
* If both configuration flags are false, Alternator doesn't even
attempt to check authentication or authorization - so errors aren't
even counted.
Because the new tests live-update the alternator_*_authorization
configuration options, they also serve as a test that live-updating
this option works correctly.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Before this patch, the configuration alternator_enforce_authorization
is a boolean: true means enforce authentication checks (i.e., each
request is signed by a valid user) and authorization checks (the user
who signed the request is allowed by RBAC to perform this request).
This patch adds a second boolean configuration option,
alternator_warn_authorization. When alternator_enforce_authorization
is false but alternator_warn_authorization is true, authentication and
authorization checks are performed as in enforce mode, but failures
are ignored and counted in two new metrics:
scylla_alternator_authentication_failures
scylla_alternator_authorization_failures
additionally,also each authentication or authorization error is logged as
a WARN-level log message. Some users prefer those log messages over
metrics, as the log messages contain additional information about the
failure that can be useful - such as the address of the misconfigured
client, or the username attempted in the request.
All combinations of the two configuration options are allowed:
* If just "enforce" is true, auth failures cause a request failure.
The failures are counted, but not logged.
* If both "enforce" and "warn" are true, auth failures cause a request
failure. The failures are both counted and logged.
* If just "warn" is true, auth failures are ignored (the request
is allowed to compelete) but are counted and logged.
* If neither "enforce" nor "warn" are true, no authentication or
authorization check are done at all. So we don't know about failures,
so naturally we don't count them and don't log them.
This patch is fairly straightforward, doing mainly the following
things:
1. Add an alternator_warn_authorization config parameter.
2. Make sure alternator_enforce_authorization is live-updatable (we'll
use this in a test in the next patch). It "almost" was, but a typo
prevented the live update from working properly.
3. Add the two new metrics, and increment them in every type of
authentication or authorization error.
Some code that needs to increment these new metrics didn't have
access to the "stats" object, so we had to pass it around more.
4. Add log messages when alternator_warn_authorization is true.
5. If alternator_enforce_authorization is false, allow the auth check
to allow the request to proceed (after having counted and/or logged
the auth error).
A separate patch will follow and add documentation suggesting to users
how to use the new "warn" options to safely switch between non-enforcing
to enforcing mode. Another patch will add tests for the new configuration
options, new metrics and new log messages.
Fixes#25308.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node`
from dtests to the repository of Scylla. We divide the test into two
steps to make testing easier and even possible with RF-rack-valid
keyspaces being enforced.
Closesscylladb/scylladb#26285
To align with `vector_indexed_table_select_statement`, this commit renames
`indexed_table_select_statement` to `view_indexed_table_select_statement`
to clarify its usage with materialized views.
The execution of SELECT statements with ANN ordering (vector search) was
previously implemented within `indexed_table_select_statement`. This was
not ideal, as vector search logic is independent of secondary index selects.
This resulted in unnecessary complexity because vector search queries don't
use features like aggregates or paging. More importantly,
`indexed_table_select_statement` assumed a non-null `view_schema` pointer,
which doesn't hold for vector indexes (where `view_ptr` is null).
This caused null pointer dereferences during ANN ordered selects, leading
to crashes (VECTOR-179). Other parts of the class still dereference
`view_schema` without null checks.
Moving the vector search select logic out of
`indexed_table_select_statement` simplifies the code and prevents these
null pointer dereferences.
Previously, streaming readers only verified the checksum of compressed SSTables.
This patch extends checks to also include the digest and the uncompressed checksum (CRC).
These additional checks require reading the digest and CRC components from disk,
which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data,
while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.
If the reader range doesn't cover the full SSTable, the digest check is skipped.
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
The method connects a socket, grabs in/out streams from it then writes
HTTP request and reads+parses the response. For that it uses class
variables for socket and streams, but there's no real need for that --
all three actually exists throughput the method "lifetime".
To fix it, coroutinizes the method. The same could be achieved my moving
the connected socket and streams into do_with() context, but coroutine
is better than that.
(indentation is left broken)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
This changes makes local output_stream variable be constructed in the
declaration statement with the help of ternary operator thus avoiding
both -- default-initialization and move-assignment depending on the
standalone condition checking.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The test consists of several snippets, each creating an input_stream for
some short operation and checking the result. Each snipped over-writes
the local `input_stream in` variable with the new one.
This change wraps each of those snippets into own code block in order to
have own new `input_stream in` variable in each.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
– Workload: N workers perform CAS updates
UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.
Closesscylladb/scylladb#26113
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
The directory_lister uses utils::lister under the hood which accepts a
callback to put directory_entry-s in. The directory_lister's callback
then puts the entries into a queue and its .get() method pops up entries
from there to return to caller.
This patch simplifies this code by switching the directory_lister to use
experimental generator lister from seastar. With it, the entries to be
returned from .get() are simply co_await-ed from calling the generator
object (wich co_yield-s them).
As a result the directory_lister becomes smaller and drops the need for
utils::lister. Since directory_lister was created as a replacement for
that callback-based lister, the latter can be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26586
Recently (#26231) there was added a test to check that several API
endpoints, that return tokens and corresponding replica nodes, are
consistent with tablet map. This patch adds one more API endpoint to the
validation -- the /storage_service/tokens_endpoint one.
The extention is pretty straightforward, but the new endpoint returns
back a single (primary) replica for a token, so the test check is
slightly modified to account for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26580
We've enabled the configuration option `rf_rack_valid_keyspaces`,
so we can finally re-enable the events creating and dropping secondary
indexes.
Fixesscylladb/scylladb#26422
We adjust the test to work with the configuration option
`rf_rack_valid_keyspaces` enabled. For that, we ensure that there is
always at least one node in each of the three racks. This way, all
keyspaces we create and manipulate will remain RF-rack-valid since they
all use RF=3.
------------------------------------------------------------------------
To achieve that, we only need to adjust the following events:
1. `init_tablet_transfer`
The event creates a new keyspace and table and manually migrates
a tablet belonging to it. As long as we make sure the migration occurs
within the same rack, there will be no problem.
Since RF == #racks, each rack will have exactly one tablet replica,
so we can migrate the tablet to an arbitrary node in the same rack.
Note that there must exist a node that's not a replica. If there weren't
such a node, the test wouldn't have worked before this commit because
it's not possible to migrate a tablet from one node being its replica to
another. In other words, we have a guarantee that there are at least 4 nodes
in the cluster when we try to migrate a tablet replica.
That said, we check it anyway. If there's no viable node to migrate the
tablet replica to, we log that information and do nothing. That should be
an acceptable solution.
2. `add_new_node`
As long as we add a node to an existing rack, there's no way to
violate the invariant imposed by the configuration option, so we pick
a random rack out of the existing three and create a node in it.
3. `decommission_node`
We need to ensure that the node we'll be trying to decommission is
not the only one in its rack.
Following pretty much the same reasoning as in `init_tablet_transfer`,
we conclude there must be a rack with at least two nodes in it. Otherwise
we'd end up having to migrate a tablet from one replica node to another,
which is not possible.
What's more, decommissioning a node is not possible if any node in
the cluster is dead, so we can assume that `manager.running_servers`
returns the whole cluster.
4. `remove_node`
The same as `decommission_node`. Just note although the node we choose to
remove must be first stopped, none other node can be dead, so the whole
cluster must be returned by `manager.running_servers`.
------------------------------------------------------------------------
There's one more important thing to note. The test may sometimes trigger
a sequence of events where a new node is started, but, due to an error
injection, its initialization is not completed. Among other things, the
node may NOT have a host ID recognized by the rest of the nodes in the
cluster, and operations like tablet migration will fail if they target
it.
Thankfully, there seems to be a way to avoid problems stemming from
that. When a new node is added to the cluster, it should appear at the
end of the list returned by `manager.running_servers`. This most likely
stems from how dictionaries work in Python:
"Keys and values are iterated over in insertion order."
-- https://docs.python.org/3/library/stdtypes.html#dict-views
and the fact that we keep track of running servers using a dictionary.
Furthermore, we rely on the assumption that the test currently works
correctly.
Assume, to the contrary, that among the nodes taking part in the operations
listed above, there is at most one node per rack that has its host ID recognized
by the rest of the cluster. Note that only those nodes can store any tablets.
Let's refer to the set of those nodes as X.
Assume that we're dealing with tablet migration, decommissioning, or removing
a node. Since those operations involve tablet migration, at least one tablet
will need to be migrated from the node in question to another node in X.
However, since X consists of at most three nodes, and one of them is losing
its tablet, there is no viable target for the tablet, so the operation fails.
Using those assumptions, an auxiliary function, `select_viable_rack`,
was designed to carefully choose a correct rack, which we'll then pick nodes
from to perform the topological operations. It's simple: we just find the first
rack in the list that has at least two nodes in it. That should ensure that we
perform an operation that doesn't lead to any unforeseen disaster.
------------------------------------------------------------------------
Since the test effectively becomes more complex due to more care for keeping
the topology of the cluster valid, we extend the log messages to make them
more helpful when debugging a failure.
This patch adds reproducing tests in test/alternator for issue #23438,
which is about missing checks for the length of headers and the URL
in Alternator requests. These should be limited, because Seastar's
HTTP server, which Scylla uses, reads them into memory so they can OOM
Scylla.
The tests demonstrate that DynamoDB enforces a 16 KB limit on the
headers and the URL of the request, but Scylla doesn't (a code
inspection suggests it does not in fact have any limit).
The two tests pass on DynamoDB and currently xfail on Alternator.
Refs #23438.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23442
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
Closesscylladb/scylladb#26319
* github.com:scylladb/scylladb:
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.
Implementation strategy:
1. Move code responsible for producing the schema
of a secondary index to the file that handles
`CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.
Fixesscylladb/scylladb#26542
Backport: we can discuss it.
Closesscylladb/scylladb#26543
* github.com:scylladb/scylladb:
index: Set tombstone_gc when creating secondary index
index: Make `create_view_for_index` method of `create_index_statement`
index: Move code for creating MV of secondary index to cql3
db, cql3: Move creation of underlying MV for index
Following DynamoDB, Alternator also places a 16 MB limit on the size of
a request. Such a limit is necessary to avoid running out of memory -
because the AWS message authentication protocol requires reading the
entire request into memory before its signature can be verified.
Our implementation for this limit used Seastar's HTTP server's
content_length_limit feature. However, this Seastar feature is
incomplete - it only works when the request uses the Content-Length
header, and doesn't do anything if the request doesn't have a
Content-Length (it may use chunked encoding, or have no length at all).
So malicious users can cause Scylla to OOM by sending a huge request
without a Content-Length.
So in this patch we stop using the incomplete Seastar feature, and
implement the length limit in Scylla in a way that works correctly with
or without Content-Length: We read from the input stream and if we go
over 16MB, we generate an error.
Because we dropped Seastar's protection against a long Content-Length,
we also need to fix a piece of code which used Content-Length to reserve
some semaphore units to prevent reading many large requests in parallel.
We fix two problems in the code:
1. If Content-Length is over the limit, we shouldn't attempt to reserve
semaphore units - this should just be a Payload Too Large error.
2. If Content-Length is missing, the existing code did nothing and had
a TODO that we should. In this patch we implement what was suggested
in that TODO: We temporarily reserve the whole 16 MB limit, and
after reading the actual request, we return part of the reservation
according to the real request size.
That last fix is important, because typically the largest requests will be
BatchWriteItem where a well-written client would want to use chunked
encoding, not Content-Length, to avoid materializing the entire request
up-front. For such clients, the memory use semaphore did nothing, and
now it does the right thing.
Note that this patch does *not* solve the problem #12166 that existed
with Seastar's length-limiting implementation but still exists in the
new in-Scylla length-limiting implementation: The fact we send an
error response in the middle of the request and then close the
connection, while the client continues to send the request, can lead
to an RST being sent by the server kernel. Usually this will be fine -
well-written client libraries will be able to read the response before
the RST. But even with a well-written library in some rare timings
the client may get the RST before the response, and will miss the
response, and get an empty or partial response or "connection reset
by peer". This issue existed before this patch, and still exists, but
is probably of minor impact.
Fixes#8196
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23434
The utils library requires OpenSSL's libcrypto for cryptographic
operations and without linking libcrypto, builds fail with undefined
symbol errors. Fix that by linking `crypto` to `utils` library when
compiled with cmake. The build files generated with configure.py already
have `crypto` lib linked, so they do not have this issue.
Fix#26705
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26707
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
#26634 will help.
- add configuration varable `alternator_max_users_query_size_in_trace_output`
with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
size, previously trunctation would also happen when data arrived in
more than one chunk
- update `truncated_content_view` to better handle truncated value
(limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
name that is not existing will return `Items` array empty
(but present) - this would raise array access exception few lines
below.
- add test
Refs #26634
Refs #24031Closesscylladb/scylladb#26618
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.
test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed.
Fixes#26606
Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.
Closesscylladb/scylladb#26622
* github.com:scylladb/scylladb:
test_tablets2: verify SSTable cleanup after tablet load and stream
tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB.
Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible
ref: https://github.com/scylladb/seastar/issues/2803
depends on: https://github.com/scylladb/seastar/pull/2960Closesscylladb/scylladb#25801
* github.com:scylladb/scylladb:
s3_client: remove unnecessary `co_await` in `make_request`
s3 cleanup: remove obsolete retry-related classes
s3_client: remove unused `filler_exception`
s3_client: fix indentation
s3_client: simplify chunked download error handling using `make_request`
s3_client: reformat `make_request` functions for readability
s3_client: eliminate duplication in `make_request` by using overload
s3_client: reformat `make_request` function declarations for readability
s3_client: reorder `make_request` and helper declarations
s3_client: add `make_request` override with custom retry and error handler
s3_client: migrate s3_client to Seastar HTTP client
s3_client: fix crash in `copy_s3_object` due to dangling stream
s3_client: coroutinize `copy_s3_object` response callback
aws_error: handle missing `unexpected_status_error` case
s3_creds: use Seastar HTTP client with retry strategy
retry_strategy: add exponential backoff to `default_aws_retry_strategy`
retry_strategy: introduce Seastar-based retry strategy
retry_strategy: update CMake and configure.py for new strategy
retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy`
retry_strategy: fix include
retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh
retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.
The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).
Fixesscylladb/scylladb#26727
backport: need to backport to 2025.4
Closesscylladb/scylladb#26719
* https://github.com/scylladb/scylladb:
paxos_state: use shards_ready_for_reads
paxos_state: inline shards_for_writes into get_replica_lock
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
Fixes https://github.com/scylladb/scylladb/issues/25341Closesscylladb/scylladb#25456
* github.com:scylladb/scylladb:
mv: limit concurrent view updates from all sources
database: rename _view_update_concurrency_sem to _view_update_memory_sem
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixesscylladb/scylladb#26732
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.
Fixes https://github.com/scylladb/scylladb/issues/25341
The std::optional<T> inject_parameter(...) method is a template, and in
dev/debug modes this parameter is defaulted to std::string_view, but for
release mode it's not. This patch makes it symmetrical.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26706
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets.
The previous implementation unlinked SSTables immediately after streaming them for the first tablet,
potentially making them partially unavailable for subsequent tablets.
This patches replaces unlink() call with mark_for_deletion()
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.
Fixesscylladb/scylladb#26727
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.
in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.
Fixes https://github.com/scylladb/scylladb/issues/26669Closesscylladb/scylladb#26410
* github.com:scylladb/scylladb:
cdc: garbage collect CDC streams periodically
cdc: helpers for garbage collecting old streams for tablets
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").
This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.
Fixes: VECTOR-257
Closesscylladb/scylladb#26510
Fixes#26641
* Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman)
* Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing
* Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest.
* Shared KMIP mock between boost test and pytest and speeds up boost test shutdown.
When merged, the dtest counterpart can be decommissioned.
Closesscylladb/scylladb#26642
* github.com:scylladb/scylladb:
test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
test::cluster::test_encryption: Port dtest EAR tests
test::cluster::conftest: Add key_provider fixture
test::pylib::encryption_provider: Port dtest encryption provider classes
test::pylib::dockerized_service: Add helper for running docker/podman
test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
Problems addressed by this PR
* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
* The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
* Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
* It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
* It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.
Fixesscylladb/scylladb#26150
backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting
Closesscylladb/scylladb#26315
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
test_fencing: fix due to new version increment
test_automatic_cleanup: clean it up
storage_proxy: wait for closing sessions in sstable cleanup fiber
storage_proxy: rename await_pending_writes -> await_stale_pending_writes
storage_proxy: use run_fenceable_write
storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
storage_proxy: introduce run_fenceable_write
storage_proxy: move update_fence_version from shared_token_metadata
storage_proxy: fix start_write() operation scope in apply_locally
storage_proxy: move post fence check into handle_write
storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
storage_proxy::handle_read: add fence check before get_schema
storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
sstable_cleanup_fiber: use coroutine::parallel_for_each
storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
topology_coordinator: barrier before cleanup
topology_coordinator: small start_cleanup refactoring
global_token_metadata_barrier: add fenced flag
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.
Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
builds mutations that update the internal cdc tables and remove the
older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
above to find a new base and build mutations to update it for a
specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.
The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.
Fixes: https://github.com/scylladb/scylladb/issues/26506
New feature, no backport.
Closesscylladb/scylladb#26515
* github.com:scylladb/scylladb:
test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
replica/mutation_dump: add support for virtual tables
tools/scylla-sstable: print_query_results_json(): handle empty value buffer
tools/scylla-sstable: add cql support to write operation
tools/scylla-sstable: write_operation(): fix indentation
tools/scylla-sstable: write_operation(): prepare for a new input-format
tools/scylla-sstable: generalize query_operation_validate_query()
tools/scylla-sstable: move query_operation_validate_query()
tools/scylla-sstable: extract schema transformation from query operation
replica/table: add virtual write hook to the other apply() overload too
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.
Fixesscylladb/scylladb#23707.
Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.
Closesscylladb/scylladb#26227
* github.com:scylladb/scylladb:
replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
replica: add tombstone_gc_enabled parameter to mutation query methods
mutation/mutation_compactor: remove _can_gc member
tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
This patch changes the tablet size map in load_stats. Previously, this
data structure was:
std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;
and is changed into:
std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;
This allows for improved performance of tablet tablet size reconciliation.
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.
Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.
Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.
This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.
Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581Closesscylladb/scylladb#26589
Refactor `chunked_download_source` to eliminate redundant exception
handling by leveraging the new `make_request` override with custom
retry strategy. This streamlines the download fiber logic, improving
readability and maintainability.
Reformats the `make_request` function declarations to improve readability
due to the large number of arguments. This aligns with our formatting
guidelines and makes the code easier to maintain.
handler
Introduce an override for `make_request` in `s3_client` to support
custom retry strategies and error handlers, enabling flexibility
beyond the default client behavior and improving control over request
handling
In the `copy_part` method, move the `input_stream<char>` argument
into a local variable before use. Failing to do so can lead to a
SIGSEGV or trigger an abort under address sanitizer.
Add a missing `case` clause to the `switch` statement to correctly
handle scenarios where `unexpected_status_error` is thrown. This
fixes overlooked error handling and improves robustness.
In AWS credentials providers, replace `retryable_http_client` with
Seastar's native HTTP client. Integrate the newly added
`default_aws_retry_strategy` to handle retries more efficiently and
reduce dependency on external retry logic.
Add a new class derived from Seastar's `default_retry_strategy`.
Relocate the `should_retry` implementation from Scylla's
`default_retry_strategy` into the new class to centralize and
standardize retry behavior.
Renames the `default_retry_strategy` class to `default_aws_retry_strategy`
to clarify its association with the S3 client implementation. This avoids
confusion with the unrelated `seastar::default_retry_strategy` class.
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.
If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.
If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.
To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
now() < written_at + min(timeout + propagation_delay)
repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
repair_time <= now()
So we know that:
repair_time < written_at + propagation_delay
With this condition we are sure that GC won't happen.
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.
Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and set of tests to ensure correct behavior and error handling.
Fixes#19060
Backport is not required, it is a new feature
Closesscylladb/scylladb#26579
* github.com:scylladb/scylladb:
sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
test/nodetool: add scrub drop-unfixable-sstables option testcase
scrub: add support for dropping unfixable sstables in segregate mode
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations.
The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper).
Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437)
backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.
Closesscylladb/scylladb#26619
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_tablets_merge_waits_for_lwt
test.py: add universalasync_typed_wrap
tablet_metadata_guard: fix split/merge handling
tablet_metadata_guard: add debug logs
paxos_state: shards_for_writes: improve the error message
storage_service: barrier_and_drain – change log level to info
topology_coordinator: fix log message
Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test,
which verifies that when the drop-unfixable-sstables flag is enabled in segregate
mode, corrupted SSTables are correctly dropped.
This patches introduces the test_scrub_drop_unfixable_sstables_option testcase,
which verifies that correct request is generated when the --drop-unfixable-sstables flag is used.
It also validates that an error is thrown if the drop-unfixable-sstables
flag is enabled and mode is not set to SEGREGATE.
This patch introduces test_scrub_drop_unfixable_sstables_option, which test
This patch adds a new flag `drop-unfixable-sstables` to the scrub operation
in segregate mode, allowing to automatically drop SSTables that
cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables'
paramater and validation to ensure this flag is not enabled in other modes rather than segragate.
Topology version is now bumped when a node finishes bootstrapping.
As a result, fence_version == version - 1, and decrementing version
in the test no longer triggers a stale topology exception.
Fix: run cleanup_all to invoke the global barrier, which synchronizes
fence_version := version on all nodes.
Remove redundant imports and variables. Extract cleanup_all
function. Add logs. Remove pytest.mark.prepare_3_racks_cluster --
the test doesn't actually need a 3 node cluster, one initial
node is enough.
All mutation_holder::apply_locall() implementations now do the same
post fence chech. In this commit we hoist this check up to
abstract_write_response_handler::apply_locally().
This function is intended to replace start_write() in subsequent
commits. It provides the following benefits:
* Remove duplication: All start_write() call sites must run the fence
check after the operation completes. run_fenceable_write() encapsulates
this pattern.
* Fix a race: To ensure no new stale write operations occur during
cleanup, a fence check before start_write() was previously used.
However, yields in several code paths between the check and
start_write() made it non-atomic, allowing a stale operation to slip in
if the fence_version was updated in between.
* Optimize waiting: We do not need to wait for all operations—only for
vnode-based, non-local tables with versions smaller than the current
fence_version.
Future commits will extend update_fence_version, and it is simpler to do
so if the function resides in storage_proxy. Additionally, fence_version
is the only field this function accesses, and it is used solely within
storage_proxy, making this change natural on its own.
The operation must be held during the local write. Before this commit,
its scope ended after returning from apply_locally(), so it
did not actually provide any protection.
handle_write() is invoked from receive_mutation_handler() and
handle_paxos_learn(), and both previously performed a fence check in
apply_fn. This commit hoists the fence check into handle_write() to
reduce code duplication.
Additionally, move start_write() after get_schema_for_write(), since
there is no need to hold the operation while querying the schema.
As noted in the code comments, start_write() does not need to be held
during counter replication; it is required only while performing local
storage modifications. Move the start_write() call and the fence
check down to mutate_counter_on_leader_and_replicate().
Additionally, mutate_counters_on_leader() is updated to check for
possible stale_topology_exception() and properly package them
in the resulting exception_variant structure.
The flush_all_tables() call ensures that no obsolete, cleanup-eligible
writes remain in the commitlog. This does not need to run under the
group0 lock, so move it outside.
Also, run await_pending_writes() before flush_all_tables(), since
pending writes may include data that must be cleaned up.
Finally, add more detailed info-level logs to trace the stages of the
cleanup procedure.
Cleanup needs a barrier to make sure that no request coordinators
are sending requests to old replicas/ranges that we're going to cleanup.
For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.
We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence
is committed to group0, we can safely proceed, since any late request
with the old topology will be fenced out on the replica.
The test for this case is added in a separate commit
"test_automatic_cleanup: add test_cleanup_waits_for_stale_writes"
Rename start_cleanup -> start_vnodes_cleanup for clarity.
Pass topology_request and server_id in start_vnodes_cleanup, we will
need them for better logging later.
Cleanup needs a barrier. For example, during node bootstrap, the cleanup
process on replicas must be protected against coordinators running
write_both_read_new and sending requests to old ranges.
We run a barrier to ensure that most data-plane requests with the old
topology finish before cleanup starts. At the same time, we do not want
to block cleanup if the barrier fails on some replicas. Once the fence is
committed to group0, we can safely proceed, since any late request with
the old topology will be fenced out on the replica.
To support this, introduce a "fenced" flag. The client can pass a pointer
to a bool, which will be set to true after the new fenced_version is
committed.
The patch e34deb72f9 (repair: Rename incremental mode name)
missed one place that references the removed regular mode name.
Fixes#26503Closesscylladb/scylladb#26660
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.
We fix this issue in this PR and add a regression test.
This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.
Fixes#26534
This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).
Closesscylladb/scylladb#26612
* github.com:scylladb/scylladb:
test: test group0 tombstone GC in the Raft-based recovery procedure
group0_state_id_handler: remove unused group0_server_accessor
group0_state_id_handler: consider state IDs of all non-ignored topology members
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.
In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.
Fixes: scylladb/scylladb#26639
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.
Fixesscylladb/scylladb#26437
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.
Among other things, the merge includes the patch "http: add "Connection:
close" header to final server response.". This Fixes#26298: A missing
response header meant that a test's client code sometimes didn't notice
that the server closed the connection (since the client didn't need to
use the connection again), which made one test flaky.
* seastar bd74b3fa...63900e03 (6):
> Merge 'Rework output_stream::slow_write()' from Pavel Emelyanov
output_stream: Fix indentation of the slow_write() method
output_stream: Remove pointless else
output_stream: Replace std::swap with std::exchange
output_stream: Unify some code-paths of slow_write()
> Merge 'Deprecate in/out streams move-assignment operator' from Pavel Emelyanov
iostream: Deprecate input/output stream default constructor and move-assignment operator
test: Sub-split test-cases
test: Don't reuse output_stream in file demo
test: Keep input_/output_stream as optional
util: Construct file_data_source in with_file_input_stream()
websocket: Construct in/out in initializer list
rpc: Wrap socket and buffers
> scripts/perftune.py: detect corrupted NUMA topology information
> Merge 'memory, smp: support more than 256 shards' from Avi Kivity
reactor, smp: allocate smp queues across all shards
memory: increase maximum shard count
memory: make cpu_id_shift and related mask dynamic
resource, memory: move memory limit calculation to memory.cc
resource: don't error if --overprovisioned and asking for more vcpus than available
> Merge 'Update perf_test text output, make columns selectable' from Travis Downs
perf_tests: enhance text output
perf_test_tests: add some check_output tests
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
This fix should be backported to 2025.4.
Closesscylladb/scylladb#26657
* github.com:scylladb/scylladb:
test/alternator/test_tablets: add test for GSI backfill with tablets
test/alternator/test_tablets: add reproducer for GSI with tablets
alternator/executor: instantly mark view as built when creating it with base table
While there is a docker interface for python, need to deal with
the docker-in-docker issues etc. This uses pure subprocess and
stream parse. Meant to provide enough flexibility for all our
docker mock server needs.
Other than patching Scylla sinks to implement new data_sink_impl::put(std::span<temporary_buffer>) overload, the PR changes transport write_response() method to stop using output_stream::write(scattered_message) because it's also gone.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26592
* github.com:scylladb/scylladb:
code: Fix indentation after previous patch
code: Switch to seastar API level 9
transport: Open-code invoke_with_counting into counting_data_sink::put
transport: Don't use scattered_message
utils: Implement memory_data_sink::put(net::packet)
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.
But it's better to add this test.
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited
Closesscylladb/scylladb#26435
There are some environment which has corrupted NUMA topology
information, such as some instance types on AWS EC2 with specific Linux
kernel images.
On such environment, we cannot get HW information correctly from hwloc,
so we cannot proceed optimization on perftune.
To avoid causing script error, check NUMA topology information and skip
running perftune if the information corrupted.
Related scylladb/seastar#2925Closesscylladb/scylladb#26344
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.
In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).
Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.
Fixesscylladb/scylladb#26615
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.
A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.
Fixes scylladb/scylladb#26355
backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets
Closesscylladb/scylladb#26408
* github.com:scylladb/scylladb:
test_tablets_lwt: add test_lwt_shutdown
storage_proxy: wait for write handler destruction
storage_proxy: coroutinize cancel_write_handlers
storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
This patch removes the dependence of vector search module
on the cql3 module by moving the contents of cql3/type_json.hh
to types/json_utils.hh and removing the usage of cql3 primary_key
object in vector_store_client. We also make the needed adjustments
to files that were previously using the afformentioned type_json.hh
file.
This fixes the circular dependency cql3 <-> vector_search.
Closesscylladb/scylladb#26482
Add a simple test verifying our changes for the compatible CDC schema.
The test checks we can write to a table with CDC enabled after ALTER and
after node restart.
in the CDC log transformer, when augmenting a base mutation, use the CDC
log schema that is compatible with the base schema, if set.
Now that the base schema has a pointer to its CDC schema, we can use it
instead of getting the current schema from the db, which may not be
compatible with the base schema.
The compatible CDC schema may not be set if the cluster is not using
raft mode for schema. In this case, we maintain the previous behavior.
When creating a schema for a non-CDC table in the schema_applier, find
its CDC schema that we created previously in the same operation, if any,
and create the schema with a pointer to the CDC schema.
We use the fact that for a base table with CDC enabled, its CDC schema
is created or altered together in the same group0 operation.
Similarly, in schema_tables, when creating table schemas from the
schema tables, first create all schemas that don't have CDC enabled,
then create schemas that have CDC enabled by extending them with the
pointer to the CDC schema that we created before.
There are few additional cases where we create schemas that we need to
consider how to handle.
When loading a schema from schema tables in the schema_loader we decide
not to set the CDC schema, because this schema is mostly used for tools
and it's not used for generating CDC mutations.
When transporting a schema by RPC in the migration manager, we don't
transport its CDC schema, and we always set it to null. Because we use
raft we expect this shouldn't have any effect, because the schema is
synchronized through raft and not through the RPC.
Previously in the schema applier we have two maps of schema_mutations,
for tables and for views. Now create another map for CDC tables by
extracting them from the non-views tables map.
We maintain the previous behavior by applying each operation that's done
on the tables map, to the CDC map as well.
Later we will want to handle CDC and non-CDC tables differently. We want
to be able to create all CDC schemas first, so when we create the
non-CDC tables we can create them with a pointer to their CDC schemas.
Add to the schema object a member that points to the CDC schema object
that is compatible with this schema, if any.
The compatible CDC schema is created and altered with its base schema in
the same group0 operation.
When generating CDC log mutations for some base mutation we want them to
be created using a compatible schema thas has a CDC column corresponding
to each base column. This change will allow us to find the right CDC
schema given a base mutation.
We also update the relevant structures in the schema registry that are
related to learning about schemas and transporting schemas across
shards or nodes.
When transporting a schema as frozen_schema, we need to transport the
frozen cdc schema as well, and set it again when unfreezing and
reconstructing the schema.
When adding a schema to the registry, we need to ensure its CDC schema
is added to the registry as well.
Currently we always set the CDC schema to nullptr and maintain the
previous behavior. We will change it in a later commit. Until then, we
mark all places where CDC schema is passed clearly so we don't forget
it.
remove the _base_info member from global_schema_ptr, and used the
base_info we have stored in the schema registry entry instead.
Currently when constructing a global_schema_ptr from a schema_ptr it
extracts and stores the base_info from the schema_ptr. Later it uses it
to reconstruct the schema_ptr, together with the frozen schema from the
schema registry entry.
But we can use the base_info that is already stored in the
schema registry entry.
Change the schema loader type in the schema_registry to return a
extended_frozen_schema instead of view_schema_and_base_info, and
remove view_schema_and_base_info which is not used anymore.
The casting between them is trivial.
The schema_registry_entry holds a frozen_schema and a base_info. The
base_info is extracted from the schema_ptr on load of a schema_ptr, and
it is used when unfreezing the schema.
But this is exactly what extended_frozen_schema is doing, so we can
just store an object of this type in the schema_registry_entry.
This makes the code simpler because the schema registry doesn't need to
be aware of the base_info.
Currently we construct a frozen schema with base info in few places, and
the caller is responsible for constructing the frozen schema and extracting
the base info if it's a view table.
We change it to make it simpler and remove the burden from the caller.
The caller can simply pass the schema_ptr, and the constructor for
extended_frozen_schema will construct the frozen schema and extract
the additional info it needs. This will make it easier to add additional
fields, and reduces code duplication.
We also make temporary castings between extended_frozen_schema and
view_schema_and_base_info for the transition, which are trivial, until
they are combined to a single type.
This commit starts a series of refactoring commits of the frozen_schema
to reduce duplication and make it easier to extend.
Currently there are two essentially identical types,
frozen_schema_with_base_info and view_schema_and_base_info in the
schema_registry that hold a frozen_schema together with a base_info for
view schemas.
Their role is to pass around a frozen schema together with additional
info that is extracted from the schema and passed around with it when
transporting it across shards or nodes, and is needed for
reconstructing it, and it is not part of the schema mutations.
Our goal is to combine them to a single type that we will call
extended_frozen_schema.
The recursive call to alter_system_schema() was missing the await
keyword, which meant the coroutine was never actually executed and
the test wasn't doing what it was supposed to do.
Not backporting: Test fix only.
Closesscylladb/scylladb#26623
Add `serve` impl that does not mess with signals, and shutdown
that does not mess with threads. Also speed up standalone shutdown
to make boost tests less slow.
In clang 21, the -fextend-variable-liveness option was made
default [1] with -Og. It helps reduce "optimized out" problems while
debugging.
However, it conflicts [2] with coroutines.
To prevent problems during the upgrade to Clang 21, disable the option.
[1] 36af7345df
[2] https://github.com/llvm/llvm-project/issues/163007Closesscylladb/scylladb#26573
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether
Fixes: https://github.com/scylladb/scylladb/issues/26483
Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions
Closesscylladb/scylladb#26527
* github.com:scylladb/scylladb:
s3_client: tune logging level
s3_client: add logging
s3_client: improve exception handling for chunked downloads
s3_client: fix indentation
s3_client: add max for client level retries
s3_client: remove `s3_retry_strategy`
s3_client: support high-level request retries
s3_client: just reformat `make_request`
s3_client: unify `make_request` implementation
Load-and-stream is broken when running concurrently to the finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes https://github.com/scylladb/scylladb/issues/26455.
Closesscylladb/scylladb#26456
* github.com:scylladb/scylladb:
test: Add reproducer for l-a-s and split synchronization issue
sstables_loader: Synchronize tablet split and load-and-stream
In 380f243986 we added support for rack
lists in replication options. Drivers which are not prepared to parse
that (as of now, all of them), will not create metadata object for
that keyspace. This breaks, for example, the "copy to/from" cqlsh
command. Potentially other things too.
To fix that, keep the "replication" column in the old format, and
store numeric RF there, which corresponds to the number of
replicas. Accurate options in the new format are put in
"replication_v2".
We set replication_v2 in the schema only when it differs from the old
"replication" so that the new column is not set during upgrade,
otherwise downgrade would fail. Partition tombstone is added to ensure
that pre-alter replication_v2 value is deleted on alters which change
replication to a value which is the same as the post-alter
"replication" value.
Fixes#26415Closesscylladb/scylladb#26429
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
Marked for backport to 2025.4 as MVs are getting un-experimentaled there.
Closesscylladb/scylladb#26278
* github.com:scylladb/scylladb:
test: mv: add a test for tablet merge
tablet_allocator, tests: remove allow_tablet_merge_with_views injection
tablet_allocator: allow merges in base tables if rf-rack-valid=true
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.
Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets
two possible fixes (maybe both):
1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets
This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.
Fixes#26455.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.
Fixes: scylladb/scylladb#26403Closesscylladb/scylladb#26588
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.
A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.
Fixesscylladb/scylladb#26355
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.
The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.
This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.
UserIdentity is a map of two fields in GetRecords responses, which
always has the same value. It may be missing, or contain a constant
object with value `{"type": "Service", "principalId":
"dynamodb.amazonaws.com"}`. Currently, the latter is set only for
`REMOVE`s triggered by TTL.
This commit introduces two new CDC operation types: `service_row_delete`
and `service_partition_delete`, emitted in place of `row_delete` and
`partition_delete`. Alternator Streams treats them as regular `REMOVE`s,
but in addition adds the `userIdentity` field to the record.
This change may break existing Scylla libraries for reading raw CDC
tables, but we doubt that anybody has this use case.
Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26121
Fixes https://github.com/scylladb/scylladb/issues/11523Closesscylladb/scylladb#26460
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:
- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation
These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.
Fixes: VECTOR-201
Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.
Closesscylladb/scylladb#25976
* github.com:scylladb/scylladb:
docs: adjust docs for VS auth changes
test: add tests for VECTOR_SEARCH_INDEXING permission
cql: allow VECTOR_SEARCH_INDEXING users to select
auth: add possibilty to check for any permission in set
auth: add a new permission VECTOR_SEARCH_INDEXING
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. That
was a bug and we fix it now.
Two reproducer tests is added for validation. They reproduce the problem
and don't pass before this commit.
Fixesscylladb/scylladb#26542
We move the code responsible for creating the schema for the underlying
materialized view of a secondary index from `index/` to `cql3/` so that
it's close to that responsible for performing `CREATE INDEX`. That's in
line with how other CQL statements are designed.
Note that the moved method is still a method of `secondary_index_manager`.
We'll make it a method of `create_index_statement` in the following
commit.
The main goal of this patch is to give more control over the creation
of the underlying view on an index to `create_index_statement.cc`.
That goal is in line with how the other statements are executed:
the schema is built in the cql3 module and only the ready schema_ptr
is passed further. That should also make the code cleaner and easier
to understand.
There are a few important things to note here:
* A call to `service::prepare_new_view_announcement` appears out of nowhere.
Aside from some validation checks and logging, that function does pretty
much the same as the pre-existing code we remove:
a. It creates Raft mutations based on the passed `view_ptr`.
b. It creates Raft mutations responsible for view building tasks.
c. It notifies about a new column family.
* We seemingly get rid of the code that creates view building tasks. That's not
true: we still do that via `service::prepare_new_view_announcement`.
That should explain why the change doesn't remove any relevant logic.
On the other hand, it might be more difficult to explain why moving the
code is correct. I'll touch on it below.
Before that, it may also be important to highlight that this commit only
affects the logic responsible for creating an index. There should be no
effect on any other part of how Scylla behaves.
---
Proving the correctness of the solution would take quite a lot of space,
so I'll only summarize it. It relies on a few things:
1. Two schema changes cannot happen in one operation. We allow for more
but only when those changes are dependent on each other and when
the additional ones are internal for Scylla, e.g. creating an index
leads to creating the underlying materialized view.
2. There are no entities or components that rely on indexes.
3. Each index is uniquely defined by the keyspace it belongs to
and the name of the index.
4. There is a bijection between rows in `system_schema.indexes`
and the currently existing indexes.
5. The name of an unnamed index depends on the name of the base table
and the names of the indexed columns. The name of an unnamed index
may have a number attached to it, but that number only depends on
the state of the schema at the time of creation of the index, and
it never changes later on. There are no other things the name of
an unnamed index depends on.
6. Scylla doesn't allow for changing any column in the base table
that has an index depending on it.
Based on that, we conclude that every existing index has exactly one
entry in `system_schema.indexes`, and the primary key of that entry
never changes.
The columns of `system_schema.indexes` that are not part of the primary
key are: `kind` and `options`. Both values are only decided at the time
of creation of an index, and currently there's no way to modify them.
That implies that there are only two events when an entry in the system
table can change: when creating an index and when dropping an index.
---
When we consider the previous place of the logic that this commit moves
to `cql3/statements/create_index_statement.cc`, it works like this:
1. We compare the sets of indexes defined on a specific table
(in the form of a structure called `index_metadata`) before and
after an operation.
2. We divide the entries into three sets: those present in both sets
and those present in only one of them.
3. We handle each of those three sets separately.
The structure `index_metadata` is a reflection of entries in
`system_schema.indexes`. It stores one more parameter -- `local` --
but its value depends on the other values of an entry, so we can ignore
it in this reasoning.
Because an index cannot be modified -- it can only be created or dropped
-- there are at most two non-empty sets: the set of new indexes and the
set of dropped indexes. Those sets are only non-empty during an operation
like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`,
`DROP KEYSPACE`. Note that it's impossible to drop an index by dropping
the underlying materialized view -- Scylla doesn't allow for that.
However, the code in `migration_manager.cc` we call
(`prepare_column_family_update_announcement`) and the code that we call
in `schema_tables.cc` (`make_update_table_mutations`) is only triggered
by *updates* related to the base table. In the context of `DROP TABLE`
or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement`
instead. In other words, we're only concerned with `CREATE INDEX` and
`DROP INDEX`.
---
A conclusion from this reasoning is that we only need to consider those
two situations when talking about correctness of this change. The impact
of this commit is that we may have potentially reordered mutations in the
resulting vector that will be applied to the Raft log.
The only mutations we may have reordered are the mutations responsible for
creating the underlying view and the mutations responsible for updating
columns in the base table. It's clear then that this commit brings no change
at all: we only give `cql3/statements/create_index_statement.cc` more
control over creating the underlying view.
---
We leave a remnant of the code in `db/schema_tables.cc` responsible
for dropping an index along with its underlying view. It would require
changing a bit more of the logic, and we don't need it for the rest
of this sequence of changes.
Refs scylladb/scylladb#16454
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030
The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.
It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.
Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.
The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated
The tests are updated to wait for the second message.
Fixes https://github.com/scylladb/scylladb/issues/26004Closesscylladb/scylladb#26392
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.
Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.
This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.
Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).
Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.
Fixes#25359Fixes#26453Closesscylladb/scylladb#26186
* github.com:scylladb/scylladb:
docs::dev::object_storage: Add some initial info on GS storage
docs/dev: Add mention of (nested) docker usage in testing.md
sstables::object_storage_client: Forward memory limit semaphore to GS instance
utils::gcp::object_storage: Add optional memory limits to up/download
sstables::object_storage_client: Add multi-upload support for GS
utils::gcp::storage: Add merge objects operation
test_backup/test_basic: Make tests multiplex both s3 and gs backends
test::cluster::conftest: Add support for multiple object storage backends
boost::gcs_storage_test: reindent
boost::gcs_storage_test: Convert to use fixture
tests::boost: Add GS object storage cases to mirror S3 ones
tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
sstables::object_storage_client: Add google storage implementation
test_services: Allow testing with GS object storage parameters
utils::gcp::gcp_credentials: Add option to create uninitialized credentials
utils::gcp::object_storage: Make create_download_source return seekable_data_source
utils::gcp::object_storage: Add defensive copies of string_view params
utils::gcp::object_storage: Add missing retry backoff increate
utils::gcp::object_storage: Add timestamp to object listing
utils::gcp::object_storage: Add paging support to list_objects
object_storage_client: Add object_name wrapper type
utils::gcp::object_storage: Add optional abort_source
utils::rest::client: Add abort_source support
sstables: Use object_storage_client for remote storage
sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
s3::upload_progress: Promote to general util type
storage_options: Abstract s3 to "object_storage" and add gs as option
sstables::file_io_extension: Change "creator" callback to just data_source
utils::io-wrappers: Add ranged data_source
utils::io-wrappers: Add file wrapper type for seekable_source
utils::seekable_source: Add a seekable IO source type
object_storage_endpoint_param: Add gs storage as option
config: break out object_storage_endpoint_param preparing for multi storage
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.
We fix this issue in this commit by considering topology members
instead.
We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.
We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.
We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
effort to minimize external dependencies in Scylla.
llvm recently updated [1] their coroutine debugging instructions.
They now recommend looking up the variable __coro_frame in the coroutine
function rather than constructing the name of the coroutine frame type
from the ramp function plus __coro_frame_ty.
Since the latter method no longer works with Clang 21 (I did not check
why), and since the former method is blessed as being more compatible,
switch to the recommended method. Since it works with both Clang 20 and
Clang 21, it future proofs the script.
[1] 6e784afcb5Closesscylladb/scylladb#26590
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Fixes#25913
Issue probably started after 9d3755f276, so backport to 2025.4
Closesscylladb/scylladb#26593
* github.com:scylladb/scylladb:
compaction: fix use after free when strategy is altered during compaction
compaction/twcs: pass compaction_strategy_state to internal methods
compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
* tools/cqlsh ff3f572...f852b1f5 (2):
> Add LZ4 as a required package - so ScyllaDB Python driver could use LZ4 compression
> github actions: replace macos-13 with macos-15-intel
Closesscylladb/scylladb#26608
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.
Fixesscylladb/scylladb#26567Closesscylladb/scylladb#26568
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.
The tests cover apply counter update and apply functionalities in the database layer.
There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.
Fixes#18164
This is an improvement. No backport needed.
Closesscylladb/scylladb#25992
* github.com:scylladb/scylladb:
test: cluster: test replica write timeout
database: parameterize apply_counter_update_delay_5s injector value
test: cluster: test replica exceptions - test rate limit exceptions
This PR adds operation per-table histograms to Alternator with item sizes involved in an operation, for each of the operations: `GetItem`, `PutItem`, `DeleteItem`, `UpdateItem`, `BatchGetItem`, `BatchWriteItem`. If read-before-write wasn't performed (i.e. it was not needed by the operation and the flag `alternator_force_read_before_write` was disabled), then we log sizes of the items that are in the request. Also, `UpdateItem` logs the maximum of the update size and the existing item size. We'll change it in a next PR.
Fixes: #25143Closesscylladb/scylladb#25529
* github.com:scylladb/scylladb:
alternator: Add UpdateItem and BatchWriteItem response size metrics
alternator: Add PutItem and DeleteItem response size metrics
alternator: Add BatchGetItem response size metrics
alternator: Add GetItem response size metrics
alternator/test: Add more context to test_metrics.py asserts
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.
Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.
The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.
Fixes#25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.
This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.
Refs #26546
Refs #25913
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
This patch introduces test `test_replica_database_apply_timeout`.
It tests timeout on database write. The test uses error injection
that returns timeout error if the injection `database_apply_force_timeout`
is enabled.
Refs #18164
Parameterize `apply_counter_update_delay_5s` injector value. Instead of
sleeping 5s when the injection is active, read parameter value that
specifies sleep duration. To reflect these changes, it is renamed to
`apply_counter_update_delay_ms` and the sleep duration is specified in
milliseconds.
Refs #18164
This patch introduces two tests for `replica::rate_limit_exception`.
One test is for write/apply limit, the other one for read/query limit.
The tests check the number of rate limit errors reported and the
number of cpp exceptions reported. If somebody adds an exception
throw on the rate limit paths, this test will catch it and fail.
Refs #18164
This PR contains various improvements in the recovery procedure
tests, mostly `test_raft_recovery_user_data`:
- decreasing the running time,
- some simplifications,
- making sure group 0 majority is lost when expected.
These are not critical test changes, so no need to backport.
Closesscylladb/scylladb#26442
* github.com:scylladb/scylladb:
test: assert that majority is lost in some tests of the recovery procedure
test: rest_client: add timeout support for read_barrier
test: test_raft_recovery_user_data: lose majority when killing one dc
test: test_raft_recovery_user_data: shutdown driver sessions
test: test_raft_recovery_user_data: use a separate driver connection for the write workload
test: test_raft_recovery_user_data: send ALTER KEYSPACE to any node
test: test_raft_recovery_user_data: bring failure_detector_timeout_in_ms back to 20 s
test: test_raft_recovery_user_data: speed up replace operations
test: stop/start servers concurrently in the recovery procedure tests
This is a minor refactoring aimed at reducing cognitive complexity of
`update_item_operation::apply`. The logic remains unchanged.
Closesscylladb/scylladb#25887
Clarified and expanded the documentation for the nodetool getendpoints command,
including detailed explanations of the --key and --key-components options.
Added examples demonstrating usage with simple and composite partition keys.
Closesscylladb/scylladb#26529
The test process like that:
- run long dns refresh process
- request for the resolve hostname with short abort_source timer - result
should be empty list, because of aborted request
The test sometimes finishes long dns refresh before abort_source fired and the
result list is not empty.
There are two issues. First, as.reset() changes the abort_source timeout. The
patch adds a get() method to the abort_source_timeout class, so there is no
change in the abort_source timeout. Second, a sleep could be not reliable. The
patch changes the long sleep inside a dns refresh lambda into
condition_variable handling, to properly signal the end of the dns refresh
process.
Fixes: #26561
Fixes: VECTOR-268
It needs to be backported to 2025.4
Closesscylladb/scylladb#26566
In the new API the biggest change is to implement the only
data_sink_impl::put(span<temporary_buffer>) overload.
Encrypted file impl and sstables compress sink use fallback_put() helper
that generates a chain of continuations each holding a buffer.
The counting_data_sink in transport had mostly been patched to correct
implementation by the previous patch, the change here is to replace
vector argument with span one.
Most other sinks just re-implement their put(vector<temporary_buffer>)
overload by iterating over span and non-preemptively grabbing buffers
from it.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The former helper is implemented like this:
future<> invoke_with_counting(fn) {
if (not_needed)
return fn();
return futurize_invoke(something).then([fn] {
return fn()
}).finally(something_else);
}
and all put() overloads are like
future<> put(arg) {
return invoke_with_counting([this, arg] {
return lower_sink.put(arg);
});
}
The problem is that with seastar API level 9, the put() overload will
have to move the passed buffers into stable storage before preempting.
In its current implementation, when counting is needed the
invoke_with_counting will link lower_sink.put() invocation to the
futurize_invoke(something) future. Despite "something" is
non-preempting, and futurize_invoke() on it returns ready future, in
debug mode ready_future.then() does preempt, and the API level 9 put()
contract will be violated.
To facilitate the switch to new API level, this patch rewrites one of
put() overloads to look like
future<> put(arg) {
if (not_needed) {
return lower_sink.put(arg);
}
something;
return lower_sink(arg).finally(something_else);
}
Other put()-s will be removed by next patch anyway, but this put() will
be patched and will call lower_sink.put() without preemption.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The API to put scattered_message into output_stream() is gone in seastar
API level 9, transport is the only place in Scylla that still uses it.
The change is to put the response as a sequence of temporary_buffer-s.
This preserves the zero-copy-ness of the reply, but needs few things to
care about.
First, the response header frame needs to be put as zero-copy buffer
too. Despite output_stream() supports semi-mixed mode, where z.c.
buffers can follow the buffered writes, it won't apply here. The socket
is flushed() in batched mode, so even if the first reply populates the
stream with data and flushes it, the next response may happen to start
putting the header frame before delayed flush took place.
Second, because socket is flushed in batch-flush poller, the temporary
buffers that are put into it must hold the foreigh_ptr with the response
object. With scattered message this was implemented with the help of a
delter that was attached to the message, now the deleter is shared
between all buffers.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
It's going to be removed by next-after-next patch, but the next one
needs this overload implemented properly, so here it is.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The series adds an experimental flag for strongly consistent tables and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.
Closesscylladb/scylladb#26116
* github.com:scylladb/scylladb:
schema: Allow configuring consistency setting for a keyspace
db: experimental consistent-tablets option
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.
Key changes:
- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.
Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414Closesscylladb/scylladb#25302
* github.com:scylladb/scylladb:
db: schema_applier: update tablet metadata atomically
db: replica: move tables_metadata locking to commit
storage_service: abstract schema dependecies during token metadata update
storage_service: split replicate_to_all_cores to steps
db: schema_applier: unify token_metadata loading
replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
service: fix dependencies during migration_manager startup
db: schema_applier: move pending_token_metadata to locator
db: always use _tablet_hint as condition for tablet metadata change
db: refactor new_token_metadata into pending_token_metadata
db: rename new_token_metadata to pending_token_metadata
db: schema_applier: move types storage init to merge_types func
db: schema_applier: make merge functions non-static members
db: remove unused proxy from create_keyspace_metadata
This commit bundle introduces metrics on item sizes for Alternator operations.
The new metrics are:
- `operation_size_kib op=UpdateItem`: Tracks the size of an `UpdateItem`
operation. This is calculated as the sum of the existing item's size
plus the estimated size of the updated fields.
- `operation_size_kib op=BatchWriteItem`: Tracks the total size of items
within a `BatchWriteItem` request, aggregated on a per-table basis. If
an item already exists, the logged size is the maximum of the old and
the new item size.
NOTE: Both metrics rely on read-before-write, so if the
`alternator_force_read_before_write` option is disabled, these metrics
may be incomplete and report inaccurate sizes.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds `operation_size_kb`
histograms for sizes of items created or replaced by the `PutItem`
operation, and sizes of items deleted by `DeleteItem` requests. The
latter needs a read-before-write, so the metrics may be incomplete if
`alternator_force_read_before_write` is disabled.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a `operation_size_kb`
per-table histogram, which contains item sizes in BatchGetItem requests.
A size of a BatchGetItem is the sum of the sizes of all items in the
operation grouped by table. In other words, a single BatchGetItem, and
BatchWriteItem for that matter, updates the histograms for each table
that it has items in.
This commit bundle introduces metrics on item sizes for Alternator
operations. Specifically, this commit adds a per-table
`operation_size_kb` histogram, recording the sizes of the items
contained in GetItem responses.
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
We fix this issue and add a regression test in this PR.
Fixes#26569
This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).
Closesscylladb/scylladb#26572
* github.com:scylladb/scylladb:
test: test_raft_recovery_entry_loss: fix the typo in the test case name
test: verify that schema pulls are disabled in the Raft-based recovery procedure
raft topology: disable schema pulls in the Raft-based recovery procedure
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.
The test test_mv_tablets_replace verifies that merging tablets of both a
view and its base table is allowed if rf-rack-valid-keyspaces option is
enabled (and it is enabled by default in the test suite).
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.
Remove the error injection from the code and adjust the tests not to use
it.
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.
Fixes: scylladb/scylladb#26273
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable. To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
The C++ test `test_indexing_paging_and_aggregation` is one of the slowest tests in test/boost. The reason for its slowness is that it needs a table with more rows than SELECT's "DEFAULT_COUNT_PAGE_SIZE" which was hard-coded to 10,000, so the test needed to write and read tens of thousands of rows, and did it multiple times.
It turns out the code actually had an ad-hoc mechanism to override DEFAULT_COUNT_PAGE_SIZE in a C++ test, but both this mechanism and the test itself were so opaque I didn't find it until I fixed it in a different way: What I ended up doing in this pull request is the following (each step in a separate patch):
1. Rewrite this test in Python, in the test/cqlpy framework. This was straightforward, as this test only used CQL and not internal interfaces. The reason why this test wasn't written in Python in the first place is that it was written in 2019, a year before cqlpy existed. A added extensive comments to the new tests, and I finally understood what it was doing :-)
2. I replaced the ad-hoc C++-test-only mechanism of overriding DEFAULT_COUNT_PAGE_SIZE by a bona-fide configuration parameter, `select_internal_page_size`.
3. Finally, the Python test can temporarily lower `select_internal_page_size` and use a table with much fewer rows.
After this series, the test `test_indexing_paging_and_aggregation` (which is now in Python instead of C++) takes around half a second, 20 times faster than before. I expect the speedup to be even more dramatic for the debug build.
Closesscylladb/scylladb#25368
* github.com:scylladb/scylladb:
cql: make SELECT's "internal page size" configurable
secondary index: translate test_indexing_paging_and_aggregation to Python
Before mutable_token_metadata_ptr containing tablet changes
was replicated to all cores in post_commit phase which violated
atomicy guarantee of schema_applier, now it's incorporated into
per shard commit phase.
It uses service::schema_getter abstraction introduced in earlier
commit to inject "pending" schema which is not yet visible to the
whole system.
The functions prepare_token_metadata_change and commit_token_metadata_change depend
on the current schema through calls to the database service. However, during an
atomic schema change, the current schema does not yet include the pending changes.
Despite that, we want to apply token metadata changes to those pending schema
elements as well.
Currently, this is achieved by postponing token metadata changes until after the rest
of the schema is committed, but this breaks atomicity. To allow incorporating the
prepare and commit phases into schema_applier, we need to abstract the schema
dependency. This will make it possible to provide, in following commits, an
implementation that includes visibility into pending changes, not just the currently
active schema.
Make use of the freshly introduced facility to disable
garbage-collection on a per-query basis for range scans. This is needed
so partitions that only contain garbage-collectible data are not missing
from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(),
the user is expecting to see *all* data, even that which is dead and
garbage-collectible.
Include a test which reproduces the issue.
Allow disabling tombstone gc on a per-query basis for mutation queries.
This is achieved by a bool flag passed to mutation query variants like
`query_mutations_on_all_shards()` and `database::mutation_query()`,
which is then propagated down to compaction_mutation_state.
The future user (in the next patch) is the SELECT * FROM
MUTATION_FRAGMENTS() statement which wants to see dead partitions
(and rows) when scanning a table. Currently, due to garbage collections,
said statement can miss partitions which only contain
garbage-collectible tombstones.
It is confusing. For query compaction, it initialized to `always_gc`,
for sstable compaction it is initialized to a lambda calling into
`can_gc()`. This makes understanding the purpose of this member very
confusing.
The real use of this member is to bridge
mutation_partition::compact_and_expire() with can_gc(). This patch
ditches the member and creates the lambda near the call sites instead,
just like the other params to `compact_and_expire()` already are.
can_gc() now also respects _tombstone_gc.is_gc_enabled() instead of just
blindly returning true when in query mode.
With this patch, whether tombstones are collected or not in query mode
is now consistent and controlled by the tombstone_gc_state.
Currently, to disable tombstone-gc on-demand completely, one has to pass
down a bool flag along with the already required tombstone_gc_state to
the code which does the compacting.
This is redundant and confusing, the tombstone_gc_state is supposed to
encapsulate all tombstone-gc related logic in a transparent way.
Add dedicated factory methods for no-gc and gc-all, to allow creating a
tombstone_gc_state which transparently gcs for all or no tombstones.
This commit adds tests to `test_streams.py` (i.e. Alternator Streams)
checking the following cases:
* putting an item with BatchWriteItem shouldn't emit a log if the old
item and the new item are identical,
* deleting an item with BatchWriteItem shouldn't emit a log if the item
doesn't exist,
* UpdateItem shouldn't emit a log if the old item and the new item are
identical.
These cases haven't been tested until this commit.
Refs https://github.com/scylladb/scylladb/issues/6918Closesscylladb/scylladb#26396
* seastar 270476e7...bd74b3fa (20):
> memory: Decay large allocation warning threshold
> iotune: fix very long warm up duration on systems with high cpu count
> Add lib info to one line backtrace
> io: Count and export number of AIO retries
> io_queue: Destroy priority class data with scheduling group
> Merge 'Expell net::packet from output_stream API stack' from Pavel Emelyanov
code: Introduce new API level
iostream: Remove write()-s of packet/scattered_message from new API level
iostream: Convert output_stream::_zc_bufs to vector of buffers
code: Add data_sink_impl::put(std::span<temporary_buffer>) method
code: Prepare some data_sink_impl::do_put(temporary_buffer) methods
iostream: Introduce output_stream::write(span<temporary_buffer>) overload
packet: Add packet(std::span<temporary_buffer>) constructor
temporary_buffer: Add detach_front() helper
> cooking: update gnutls to 3.7.11
> file: Configure DMA alignment from block size
> util: adapt to fmt 12.0.0 API changes
> Merge 'Internalize reactor::posix_... API methods' from Pavel Emelyanov
reactor: Deprecate and internalize posix_connect()
reactor: Deprecate and internalize posix_listen()
> cooking: update fmt to modern version
> Merge 'Add prometheus bench, coroutinize prometheus' from Travis Downs
prometheus: coroutinize metrics writing
prometheus_test: add global label test
introduce metrics_perf bench
> operator co_await: use rvalue reference
> futurize::invoke: use std::invoke
> io_tester: Don't skip 0 position in sequential workflows
> io_queue: Use own logger for messages
> .clangd: tell the LSP about seastar's header style
> docker: Update to plucky
> Merge 'Convert timer test into seastar test (and a bit more)' from Pavel Emelyanov
test: Remove OK macro
test: Fix one failure check
test: Use boost checkers instead of BUG() macro
test: Fix indentation after previous patch
test: Convert timer_test into seastar test(s)
Closesscylladb/scylladb#26560
In some uses of SELECT, such as aggregation (sum() et al.), GROUP BY or
secondary index, it needs to perform internal scans. It uses an "internal
page size" which before this patch was always DEFAULT_COUNT_PAGE_SIZE = 10000.
There was an ad-hoc and undocumented way to override this default in C++
tests, using functions in test/lib/select_statement_utils.hh, but it
was so non-obvious that the test that most needed to override this
default - the very slow test test_indexing_paging_and_aggregation which
would have been must faster with a lower setting - never used it.
So in this patch we replace the ad-hoc configuration functions by a
bona-fide Scylla configuration option named "select_internal_page_size".
The few C++ tests that used the old configuration functions were
modified to use the new configuration parameters. The slow test
test_indexing_paging_and_aggregation still doesn't use the new
configuration to become faster - we'll do this in the next patch.
Another benefit of having this "internal page size" as a configuration
option is that one day a user might realize that the default choice
10,000 is bad for some reason (which I can't envision right now), so
having it configurable might come it handy.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.
The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.
Fixes#26569
The Boost test test_indexing_paging_and_aggregation is one of the slowest
boost tests. But it's hard to understand why it needs to be so slow - the
C++ test code is opaque, and uncommented. The test didn't need to be in
C++ - it only uses CQL, not any internal interfaces - but it was written
in 2019, a year before test/cqlpy was created.
So before we can make this test faster, this patch translates it to
Python and adds significant amount of comments. The new Python test is
functionally identical to the old C++ test - it is not (yet) made
smaller or faster. The new test takes a whopping 9 seconds to run on
my laptop (in dev build mode). We'll reduce that in the next patch.
As usual, the cqlpy test can also be tested on Cassandra, and
unsurprisingly, it passes.
Refs #16134 (which asks to translate more MV and SI tests to Python).
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a struct `per_request_options` used to communicate between CDC and upper abstraction layers. We need this for better compatibility with DynamoDB Streams in Alternator (https://github.com/scylladb/scylladb/issues/6918) to change operation types of log rows. This patch also adds a way to conditionally forward the item read by LWT to CDC and use it as a preimage. For now, only Alternator uses this feature.
The main changes are:
- add a struct `cdc::per_request_options` to pass information between CDC and upper abstraction layers,
- add the struct to `cas_request::apply`'s signature,
- add a possibility to provide a preimage fetched by an upper abstraction layer (to propagate a row read by Alternator to CDC's preimage). This reduces the number of reads-before-write by 1 for some **Alternator** requests and it is always safe. It's possible to use this feature also in CQL.
No backport, it's a feature.
Refs https://github.com/scylladb/scylladb/issues/6918
Refs https://github.com/scylladb/scylladb/pull/26121Closesscylladb/scylladb#26149
* github.com:scylladb/scylladb:
alternator, cdc: Re-use the row read by LWT as a CDC preimage
cdc: Support prefetched preimages
storage: Add cdc options to cas_request::apply
cdc, storage: Add a struct to pass per-mutation options to CDC
cdc: Move operations enum to the top of the namespace
Tiny code cleanup to improve readability without changing behavior.
Changes:
- remove unused variables and imports,
- remove redundant whitespaces, and a duplicated `public:` access
specifier,
- use `is_aws` function to check if running in AWS
test/alternator/test_metrics.py,
- other trivial changes.
Closesscylladb/scylladb#26423
Unfortunately, the test became flaky and is blocking promotion. The
cause of the flaky is not known yet but unrelated to other items
currently queued on the `next` branch. The investigation continues on
GitHub issue scylladb/scylladb#26534.
In the meantime, skip the test to unblock other work.
Refs: scylladb/scylladb#26534Closesscylladb/scylladb#26549
Into total and live. Currently only live (those with live content) are
counted. Report live and total seprately, just like we do for rows. This
allows deducing the count of dead partitions as well, which is
particularly interesting for scans.
Closesscylladb/scylladb#26548
We need to avoid reloading schema early as it goes via
schema_applier which internally depends on storage_service
and on distribued_loader initializing all keyspaces.
Simply moving migration manager startup later in the code is not
easy as some services depend on it being initialized so we just
enable those feature listeners a bit later.
It never belonged to tables and views and its placement stems
from location of _tablet_hint handling code.
In the follwing commits we'll reference it in storage_service.cc.
It prepares pending_token_metadata to handle both new and copy
of existing metadata for consistent usage in later commit.
It also adds shared_token_metatada getter so that we don't
need to get it from db.
This is mechanical change which simplifies the code. Schema_applier
class is an object which holds schema merging intermediate state
so it's fine that all schema merging functions have access to this state.
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.
Closesscylladb/scylladb#26466
Propagates the row read by CAS to CDC's preimage to save one
read-before-write.
As of now, a preimage in Alternator Streams always contains the entire
item (see previous_item_read_command in executor.cc), so the resulting
preimage should stay the same. In other words, this change should be
transparent to users.
This commit adds support to pass a preimage selected by an upper layer
to CDC. The responsibility for the correctness of the preimage (i.e. the
selected columns, whether it's up to date, etc.) lies with the caller.
It may be improved in the future by validating the preimage, e.g. by
"slicing" the received preimage to the necessary columns.
The motivation behind this change was to reduce the number of
read-before-writes and avoid reading the row twice for Alternator
Streams in an increased compatibility mode with DynamoDB. This is to be
added in a following commit. Until now, this commit should be a no-op.
During Scylla startup, directories are created and verified in
`directories::do_verify_owner_and_mode()`. It is possible that while
retrieving file stats, a file might be removed, leading to Scylla
failing to boot.
This is particularly visible in `storage/test_out_of_space.py` tests,
which use FUSE to mount size-limited volumes. When a file that is open
by another process is removed, FUSE renames it to `.fuse_hidden*`.
In `directories::do_verify_owner_and_mode()`, the code performs a
`scan_dir` to list files and retrieves their stats to verify type, mode,
and ownership. If a file is removed while retrieving its stats, we see
errors such as:
```
Failed to get /scylladir/testlog/x86_64/dev/volumes/e0125c60-1e63-4330-bf6f-c0ea3e466919/scylla-0/hints/1/.fuse_hidden0000001800000005
```
This change makes `do_verify_owner_and_mode()` ignore files when
retrieving stats fails, avoiding spurious errors during verification.
Refs: https://github.com/scylladb/scylladb/issues/26314Closesscylladb/scylladb#26535
Not supported currently as such tables have no memtables, cache or
sstables, so any select * from mutation_fragments() query will return
empty result.
Detect virtual tables and add return their content with a distinct
'virtual-table' mutation_source designation.
Add new --input-format command line argument. Possible values are json
(current) and cql (new -- added in this patch).
When --input-format=cql (new default), the input-file is expected to
contain CQL INSERT, UPDATE or DELETE statements, separated by semicolon.
The input file can contain any number of statements, in any order. The
statements will be executed and applied to a memtable, which is then
flushed to create an sstable with the content generated from the
statement. The memtable's size is capped at 1MiB, if it reaches this
size, it is flushed and recreated. Consequently, multiple sstables can
be created from a single scylla-sstable write --input-format=cql
operation.
We include more relevant information for debugging purposes:
the remaining bytes and the size. It might be useful to determine
where exactly an error occurred and help reason about it.
Closesscylladb/scylladb#26486
This patch makes three small mostly-cosmetic improvements to a test in
test/alternator/test_streams.py:
1. The test is renamed "test_streams_deleteitem_old_image_no_ck" to
emphasize its focus on the combination of deleteitem, old image,
and no ck. The "putitem" we had in the name was not relevant, and
the "old_image" was missing and important.
2. Moreover, using PutItem in this test just to set up the test scenario
mixed the bug which the test tries to reproduced with a different
only-recently-fixed bug (that PutItem also generated a spurious
"REMOVE" event). So I changed the use of PutItem by using UpdateItem,
to make this test indepedent of the other bug. Test independence is
important because it allows us - if we want - to backport a fix for
just one bug independently of the fix to the other bug.
3. Also improved the comment in front of the test to mention where we
already tested the with-ck case, and also to mention issue 26382
which this test reproduces (the xfail line also mentions it, but
the xfail line will be removed when the bug is fixed - but the
mention in the comment will remain - and should remain.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26526
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.
Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.
- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.
- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.
- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.
- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.
Fixes: #26468
It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.
Closesscylladb/scylladb#26374
* github.com:scylladb/scylladb:
vector_search: Unify test timeouts
vector_search: Fix missing timeout reset
vector_search: Refactor ANN request test
vector_search: Fix flaky connection in tests
vector_search: Fix flaky test by mocking DNS queries
In the next patches a new input-format will be introduced, which can
produce multiple output format. To prepare for this, consolidate the
code which produces an sstable into a reusable lambda function.
Moves code around, reduces churn in next patches. Indentation is left
broken for easier review.
Make error messages more generic, so they are not specific to select.
Make it a template on the type of cql statement for the final check. To
avoid templating the whole thing, the function is split into two.
Parametrize the name of the allowed statement types in said check.
Prepares the method to be shared between query operation and write
operation (future change).
While at it, also change query param type to std::string_view to avoid
some copies.
This transformation enables an existing schema to be created as a table
in cql_test_env, to be used to read/write sstables belonging to said
schema.
Extract this into a method, to be shared by a future operation which
will also want to do this.
Augments the object storage document with config options etc for
using GS instead of S3.
TODO: add proper gsutil command line examples for manual managing of
GCP storage.
Adds optional memory semaphore to limit the mem buffer usage in sink/source.
Note that we don't bookkeep exact, to avoid deadlock issues in higher layer.
In upload, we overlease on first buffer put to ensure we can at least fill
the desired 8M of buffers. We try to adjust when going over, but if we
fail, we fail, but at least will initiate upload -> soon release memory.
On next put, we try to grab multiples of 8M again, and so forth. Thus
potentially causing waiting for resources, without ending up not uploading
at least one active sink.
For download (source), we try to get lease for as much as we want to read,
but if we fail, we adjust this down to 256k and download anyway. Since this
will typically be released immediately, we at least don't overrun for long,
and again, avoid fully stopping, throttling rate instead.
Adds an `object_storage` fixture with paramterization to iterate through
's3' and 'gs' backends.
For the former, will instansiate the `s3_server` backend (modified to better
handle being actual temp, function level server).
For the latter, will either give back a frontend if env vars indicating
"real" GS buckets and endpoints are used, or launch a docker image for
fake-gcs-server on a free port.
Please read the comment in the code about the management of server output,
as this is less than optimal atm, but I can't figure out the issue with it.
All returned fixture objects will respond to `address`, `bucket` properties,
as well as be able to create endpoint config objects for scylla.
To avoid having to async wait for creating credentials, allow lazy
init (in actual token renew) of credentials. This is not super
pleasant, since it means any error will be late, but it is required
more or less for the code paths into which we intend to place this.
Since, given the nature of object storage API:s, it is no more complicated to
provide a reasonable implementation of a seekable, limited, interface,
give this back, which in turn means upper layers can provide easy read-only file
interfaces. Hint hint.
Since both are bucket+prefix oriented, we can basically use same
options for both, only distinguished by actual protocol.
Abstract the types and the helper parse etc routines to handle either.
Use "gs" as term for gcs (google compute storage), since this is the
URL scheme used.
Because the concept of pushing reading range does not work for the wrapping
we do (i.e. encryption), there is no point having it here. We need to do
said range handling higher up.
Also, must allow multi-layered wrapping.
Extension of data_source, with the ability to
a.) Seek in any direction, i.e. move backwards. Thus not pure stream.
b.) Read a limited number of bytes.
The very transparent reason for the interface is to have a base
abstraction for providing a read-only file layer for networked
resources.
Moves the config wrapper to own file (to reduce recompilation for modifying)
and refactors to handle extending this parameter to non-s3 endpoint configs.
We add missing `const`-qualifiers wherever possible in the module.
A few smaller changes were included as a bonus.
Backport: not needed. This is a cleanup.
Closesscylladb/scylladb#26485
* github.com:scylladb/scylladb:
index/secondary_index_manager: Take std::span instead of std::vector
index/secondary_index_manager: Add missing const qualifier
index/vector_index: Add missing const qualifiers
cql3/statements/index_prop_defs.cc: Remove unused include
cql3/statements/index_prop_defs.cc: Mark function as TU-local
cql3/statements/index_prop_defs: Mark methods as const-qualified
ZSTD_CDict needs a big contiguous allocation and there's no way around that.
The only thing to do is relax the warning appropriately.
Closesscylladb/scylladb#25393
So tombstones can be purged correctly based on the tombstone gc mode.
Currently if repair-mode is used, tombstones are not purged at all,
which can lead to purged tombstone being re-replicated to replicas which
already purged them via read-repair.
This is not a correctness problem, tombstones are not included in data
query resutl or digest, these purgable tombstone are only a nuissance
for read repair, where they can create extra differences between
replicas. Note that for the read repair to trigger, some difference
other than in purgable tombstones has to exist, because as mentioned
above, these are not included in digets.
Fixes: scylladb/scylladb#24332Closesscylladb/scylladb#26351
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.
Fixes: scylladb/scylladb#25432Closesscylladb/scylladb#26304
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.
Fixesscylladb/scylladb#24982Closesscylladb/scylladb#26472
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.
This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.
The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.
Fixes: https://github.com/scylladb/scylladb/pull/25847Closesscylladb/scylladb#25842
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.
The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.
This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.
Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.
This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.
The fix is to always read the request body in the mock server.
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.
This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.
`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.
The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.
The change also simplifies the test making it more readable.
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.
To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.
Fixes#26457Closesscylladb/scylladb#26489
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.
Before:
- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
After:
- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)
Fixes#26503Closesscylladb/scylladb#26504
Using `driver_connect()` after a cluster restart isn't enough to ensure
full CQL availability, but the test assumes that it is.
Fix that by making the test wait for CQL availability via `get_ready_cql()`.
Also, replace some manual usages of wait_for_cql_and_get_hosts with
`get_ready_cql()` too.
Fixesscylladb/scylladb#25362Closesscylladb/scylladb#25366
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).
Fixes https://github.com/scylladb/scylladb/issues/26417
The patch should be backported to 2025.4
Closesscylladb/scylladb#26446
* github.com:scylladb/scylladb:
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
db/view/view_building_worker: futurize and rename `start_background_fibers()`
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).
State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.
This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.
Fixesscylladb/scylladb#26204Closesscylladb/scylladb#26289
In f828fe0d59 ("setup: add the lazytime XFS version") we added the
lazytime mount option to /var/lib/scylla, but it was quickly reverted
(8f5e80e61a) as it caused a regression on CentOS 7.
We reinstate it now with a kernel version check. This will avoid
the lazytime mount option on CentOS 7, which is unsupported anyway.
The lazytime option avoids marking the inode as dirty if it's only for the
purpose of updating mtime/ctime. This won't help much while writing sstables
(since the write also updates extent information), but may help a little
with with commitlog writes, since those are pure overwrites.
It likely won't help with the RWF_NOWAIT violations seen in [1], since
those are likely due to in-memory locking, not flushing dirty inodes
to disk.
Tested with an install to Ubuntu 24.04 LTS followed by a scylla_setup run.
The lazytime option was added the the .mount file and showed up in
the live mount.
[1] https://github.com/scylladb/seastar/issues/2974
Closes scylladb/scylladb#26436
Fixes#26002
This will allow us to communicate with CDC from higher layers. We plan
to use it to reduce the number of read-before-writes with preimages by
passing the row selected in upper layers.
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).
Let's just dodge the issue by using --smp=1.
Fixesscylladb/scylladb#26432Closesscylladb/scylladb#26434
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.
This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.
This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.
Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
Closesscylladb/scylladb#25315
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixes https://github.com/scylladb/scylladb/issues/25290
backport possible to improve stability
Closesscylladb/scylladb#26180
* github.com:scylladb/scylladb:
service/qos: set long timeout for auth queries on SL cache update
auth: add query_state parameter to query functions
auth: refactor query_all_directly_granted
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.
This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).
Fixesscylladb/scylladb#26417
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.
`sl:driver` is expected to be used for new and control connections,
but other connections that run user load should not use it after
the user is authenticated.
Refs: scylladb/scylladb#24411
Before `sl:driver` was introduced, service levels were assigned as
follows:
1. New connections were processed in `main`.
2. After user authentication was completed, the connection's SL was
changed to the user's SL (or `sl:default` if the user had no SL).
This commit introduces `service_level_state` to `client_state` and
implements the following logic in `transport/server`:
1. If `sl:driver` is not present in the system (for example, it was
removed), service levels behave as described above.
2. If `sl:driver` is present, the flow is:
I. New connections use `sl:driver`.
II. After user authentication is completed, the connection's SL is
changed to the user's SL (or `sl:default`).
III. If a REGISTER (to events) request is handled, the client is
processing the control connection. We mark the client_state
to permanently use `sl:driver`.
The aforementioned state `2.III` is represented by
`_control_connection` flag in `client_state`.
Fixes: scylladb/scylladb#24411
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.
This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.
Fixes: scylladb/scylladb#26040
Before this change, new connections were handled in a default
scheduling group (`main`), because before the user is authenticated
we do not know which service level should be used. With the new
`sl:driver` service level, creation of new connections can be moved to
`sl:driver`.
We switch the service level as early as possible, in `do_accepts`.
There is a possibility, that `sl:driver` will not exist yet, for
instance, in specific upgrade cases, or if it was removed. Therefore,
we also switch to `sl:driver` after a connection is accepted.
Refs: scylladb/scylladb#24411
Driver service level is a special service level that is created
automatically by the system. Therefore, it requires special handling
in DESC SCHEMA WITH INTERNALS and those test verifies the special
behavior.
Refs: scylladb/scylladb#24411
This commit:
- Increases the number of allowed scheduling groups to allow the
creation of `sl:driver`.
- Adds the `DRIVER_SERVICE_LEVEL` feature, which prevents creating
`sl:driver` until all nodes have increased the number of
scheduling groups.
- Starts using `get_create_driver_service_level_mutations`
to unconditionally create `sl:driver` on
`raft_initialize_discovery_leader`. The purpose of this code
path is ensuring existence of `sl:driver` in new system and tests.
- Starts using `migrate_to_driver_service_level` to create `sl:driver`
if it is not already present. The creation of `sl:driver` is
managed by `topology_coordinator`, similar to other system keyspace
updates, such as the `view_builder` migration. The purpose of this
code path is handling upgrades.
- Modifies related tests to pass after `sl:driver` is added.
Later in this patch series, `sl:driver` will be used by
`transport/server` to handle selected traffic, such as the driver's
schema and topology fetches.
Refs: scylladb/scylladb#24411
This commit implements `get_create_driver_service_level_mutations`
and `migrate_to_driver_service_level` in service_level_controller.
Both methods create `sl:driver` with shares=200 and store this fact
in `system.scylla_local`. Both methods will be used later in this
patch series for automatic creation of sl:driver.
Refs: scylladb/scylladb#24411
Later in this patch series, `sl:driver` will be added as a special
service level created automatically by the system. It needs special
handling in `DESC SCHEMA ...` to ensure that during backup restore:
1. CREATE SERVICE LEVEL does not fail if `sl:driver` already exists
2. If `sl:driver` exists, its configuration is fully restored (emit
ALTER SERVICE LEVEL).
3. If `sl:driver` was removed, the information is retained (emit
DROP SERVICE LEVEL instead of CREATE/ALTER).
Refs: scylladb/scylladb#24411
This adds a reference to sl_controller so that, later in this patch
series, topology_coordinator can manage creating `sl:driver` once
group0 is fully operational.
Refs: scylladb/scylladb#24411
This commit extends sytem.scylla_local table with an additional
key/value pair that can be used later in this patch series to
keep an information that `sl:driver` was already created. The purpose
of storing this information is to ensure that `sl:driver` is
not recreated after being intentionally removed.
A new mutation is included in `register_raft_pull_snapshot` to keep
`service_level_driver_created` in state machine shapshot, which is
required for proper propagation of the value when a new node is added
to the cluster.
Refs: scylladb/scylladb#24411
Previously, tests used the hardcoded value 7 for the maximum number of
user service levels. This commit introduces a named variable that can
be shared across tests to avoid cases where this magic number goes
out of sync.
The voter handler caused `test_raft_recovery_user_data` to stop losing
group 0 majority when expected. We make sure this won't happen again
in this commit.
We don't change `test_raft_recovery_entry_lose` because it has some
checks that would fail with group 0 majority (schema versions would
match).
Note that it's possible to timeout the read barrier quickly without the
`timeout` parameter. See e.g. `test_cannot_add_new_node` in
`test_raft_no_quorum.py`. We don't take this approach here because we
don't want to change the default Raft parameters in the recovery
procedure tests.
After introducing the voter handler, the test stopped losing group 0
majority when expected because the killed dc contained 2 out of 5
voters. We fix it in this commit. The fix relies on the voter handler
not doing unnecessary work. The first dc should keep its voters and
majority.
The test was functional even though majority wasn't lost when expected.
Stopping the recovery leader before restarting it with `recovery_leader`
caused majority loss in the old group 0. Hence, there is no need to
backport this commit.
Shutting down `ccluster_all_nodes` in the previous commit is necessary
to avoid flakiness. It turns out that leaked driver sessions can impact
another run of the test case (with different parameterization). Here,
without shutting down `ccluster_all_nodes`, we could observe the DDL
requests from `start_writes` fail in the second test case run
(where `remove_dead_nodes_with == "replace"`) like this:
```
> await cql.run_async(f"USE {ks_name}")
E cassandra.cluster.NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.46.35.70:9042 dc1>:
ConnectionException('Host has been marked down or removed'),
<Host: 127.46.35.71:9042 dc1>: ConnectionException('Host has
been marked down or removed'), <Host: 127.46.35.3:9042 dc1>:
ConnectionException('Host has been marked down or removed'),
<Host: 127.46.35.25:9042>: ConnectionException('Host has
been marked down or removed')})
```
We could also see errors like this on the driver:
```
cassandra.InvalidRequest: Error from server: code=2200 [Invalid query]
message="Keyspace 'test_1759763911381_oktks' does not exist"
```
It turned out that `test_1759763911381_oktks` was created in the first
test case run (where `remove_dead_nodes_with == "remove"), and somehow
the driver session created in the second test case run was still using
this keyspace in some way. The DDL requests were failing on the Scylla
side with the error above, and after some retries, the driver marked
nodes as down. I didn't try to investigate what exactly the driver was
doing.
In this commit, we shut down other driver sessions used in this test.
They didn't cause problems so far, but we'd better use the Python driver
correctly and be safe.
It's simpler than pausing the workload for the `cql` reconnection.
Moreover, the removed `start_writes` call required group 0 majority for
(redundant) CREATE KEYSPACE IF NOT EXISTS and CREATE TABLE IF NOT EXISTS
statements. The test shouldn't have group 0 majority at that point,
which is fixed in one of the following commits.
Using a separate driver connection also allows us to call
`finish_writes()` a bit later, after the `cql` reconnection.
It looks like decreasing `failure_detector_timeout_in_ms` doesn't make
the shutdown faster anymore.
We had some changes related to requests during shutdown like #24499
and #24714. They are probably the reason.
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.
Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.
Fixesscylladb/scylladb#26420Closesscylladb/scylladb#26421
This patch adds tests for:
- tablet migration during view building
- tablet merge during view building.
Those tests were missing from the original testing plan.
We want to backport it to 2025.4 to ensure the release is bug-free.
Closesscylladb/scylladb#26414
* github.com:scylladb/scylladb:
test/cluster/test_view_building_coordinator: add test for tablet merge
test/cluster/test_view_building_coordinator: add test for tablet migration
Seastar httpd recommended users to stop using contiguous requet.content string and read body they need from request's input_stream instead. However, "official" deprecation of request content had been only made recently.
This PR patches REST API server to turn this feature on and patches few handlers that mess with request bodies to read them from request stream.
Using newer seastar API, no need to backport
Closesscylladb/scylladb#26418
* github.com:scylladb/scylladb:
api: Switch to request content streaming
api: Fix indentation after previous patch
api: Coroutinize set_relabel_config handler
api: Coroutinize set_error_injection handler
This dependency reference is carried into column_family handlers block to make get_built_views handler work. However, the handler in question should live in view_builder block, because it works with v.b. data. This PR moves the handler there, while at it, coroutinizes it, and removes the no longer needed sys.ks. reference from column_family.
API dependencies cleanup work, no need to backport
Closesscylladb/scylladb#26381
* github.com:scylladb/scylladb:
api: Fix indentation after previous patch
api: Coroutinize get_built_indexes handler code
api: Remove system_keyspace ref from column_family API block
api: Move get_built_indexes from column_family to view_builder
If mis-used, the script says
error: unrecognized option: ..., see ./scripts/pull_github_pr.sh -h for usage
but if using the suggested -h option it prints just the same.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26378
The PR #26154 dropped the `-fvisibility=hidden` compiler flag and
replaced it with `-fvisibility-inlines-hidden` as the former caused
issues in how the `noncopyable_function::operator bool` method executed
leading to incorrect return values. Apply the same fix to cmake.
Fixes#26391Closesscylladb/scylladb#26431
There are three handler that need to be patched all at once with the
server itself being marked with set_content_streaming
For two simple handler just get the content string with
read_entire_stream_contiguous helper. This is what httpd server did
anyway.
The "start_restore" handler used the contiguous contents to parse json
from using rjson utility. This handler is patched to use
read_entire_stream() that returns a vector of temporary buffers. The
rjson parser has a helper to pars from that vector, so the change is
also optimization.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Without the invoke_on_all lambda, for simplicity
Also keep indentation "broken" for the ease of review
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.
The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
example, all 3 nodes that first joined the new group 0 are in the same
dc and rack, but only one of them becomes a voter because the voter
handler tries to make non-members in other dcs/racks voters.
Fixes#26321Closesscylladb/scylladb#26327
Some code wants its TLS sockets to close immediately without sending BYE
message and waiting for the response. Recent seastar update changed the
way this functionality is requested (scylladb/seastar#2986)
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26253
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.
(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).
Fix that.
Fixesscyllladb/scylladb#26371Closesscylladb/scylladb#26196
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.
But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.
Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.
Fixesscylladb/scylladb#26372Closesscylladb/scylladb#26373
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }
Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.
Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.
Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.
New feature, no backport required.
Co-authored with @bhalevy
Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525Closesscylladb/scylladb#26358
* github.com:scylladb/scylladb:
tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
locator: Make hasher for endpoint_dc_rack globally accessible
test: tablets: Add test for replica allocation on rack list changes
test: lib: topology_builder: generate unique rack names
test: Add tests for rack list RF
doc: Document rack-list replication factor
topology_coordinator: Restore formatting
topology_coordinator: Cancel keyspace alter on broader set of errors
topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
cql3: ks_prop_defs: Preserve old options
cql3: ks_prop_defs: Introduce flattened()
locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
tablet_allocator: Respect binding replicas to racks
locator: network_topology_strategy: Respect rack list when reallocating tablets
cql3: ks_prop_defs: Fail with more information when options are not in expected format
locator, cql3: Support rack lists in replication options
cql3: Fail early on vnode/tablet flavor alter
cql3: Extract convert_property_map() out of Cql.g
schema: Use definition from the header instead of open-coding it
locator: Abstract obtaining the number of replicas from replication_strategy_config_option
cql3, locator: Use type aliases for option maps
locator: Add debug logging
locator: Pass topology to replication strategy constructor
abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.
After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.
In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.
After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.
High-level implementation strategy:
1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
and add a new one.
5. Update the user documentation.
Fixesscylladb/scylladb#23030
Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.
Closesscylladb/scylladb#25802
* github.com:scylladb/scylladb:
view: Stop requiring experimental feature
db/view: Verify valid configuration for tablet-based views
db/view: Require rf_rack_valid_keyspaces when creating view
test/cluster/random_failures: Skip creating secondary indexes
test/cluster/mv: Mark test_mv_rf_change as skipped
test/cluster: Adjust MV tests to RF-rack-validity
test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
db/view: Name requirement for views with tablets
The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people.
The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear.
Code cleanup, no backport.
Closesscylladb/scylladb#26280
* github.com:scylladb/scylladb:
replica: move querier code to replica namespace
root,replica: mv querier to replica/
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Also, extend a related test so that it would catch the problem before the fix.
Fixesscylladb/scylladb#26393
Bugfix, needs backport to 2025.4.
Closesscylladb/scylladb#26394
* github.com:scylladb/scylladb:
sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
test/boost/database_test: fix two no-op distributed loader tests
The reason for this seastar update is fixing #26190 - a service
level bug caused by a problem in scheduling group in seastar
implementation (seastar#2992).
* ./seastar 9c07020a...270476e7 (10):
> core: restore seastar_logger namespace in try_systemwide_memory_barrier
> Merge 'coroutines: support coroutines that copy their captures into the coroutine frame' from Avi Kivity
coroutines: advertise lambda-capture-by-value and test it
future: invoke continuation functions as temporaries
future: handle lvalue references in future continuations early
> resource: Tune up some allocate_io_queues() arguments
> Merge 'Add perf test hooks' from Travis Downs
perf_tests:add tests to verify pre-run hooks
per_tests: add pre-run hooks
perf-tests.md: update on measurement overhead
perf_tests_perf: a few more test variations
remove vestigial register_test method
> Add `touch` command to `rl` file processing
> Merge 'execution_stage: update stage name on scheduling_group rename' from Andrzej Jackowski
test: add sg_rename_recreate_with_the_same_name
test: add test_renaming_execution_stage in metric_test
test: add test_execution_stage_rename
execution_stage: update stage name on scheduling_group rename
execution_stage: reorganize per_group_stage_type
execution_stage: add concrete_execution_stage_base
execution_stage: move metrics setup to a separate method
> iotune: Fix warmup calculation bug and botched rebase
> Add missing `#pragma once` to ascii.rl
> iotune: Ignore measurements during warmup period
Fixes: https://github.com/scylladb/scylladb/issues/26190Closesscylladb/scylladb#26388
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).
But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.
The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.
Fixesscylladb/scylladb#26393
There are two tests which effectively check nothing.
They intend to check that distributed loader removes "leftover" sstable
files. So they create some incomplete sstables, run the test env
on the directory, and the files disappeared.
But the test env completely clears the test directory before
the distributed loader looks at the files, so the tests succeed trivially.
Fix that by adding a config knob to the test env which instructs it
not to clear the directory before the test.
We adjust the documentation to include the new
VECTOR_SEARCH_INDEXING permission and its usage
and also to reflect the changes in the maximal
amount of service levels.
This commit adds tests to verify the expected
behavior of the VECTOR_SEARCH_INDEXING permission,
that is, allowing GRANTing this permission only on
ALL KEYSPACES and allowing SELECT queries only on tables
with vector indexes when the user has this permission
This patch allows users with the VECTOR_SEARCH_INDEXING permission
to perform SELECT queries on tables that have a vector index.
This is needed for the Vector Store service, which
reads the vector-indexed tables, but does not require
the full SELECT permission.
This commit adds a new version of command_desc struct
that contains a set of permissions instead of a singular
permission. When this struct is passed to ensure/check_has_permission,
we check if the user has any of the included permission on the resource.
This patch adds a new permission: VECTOR_SEARCH_INDEXING,
that is grantable only for ALL KEYSPACES. It will allow selecting
from tables with vector search indexes. It is meant to be used
by the Vector Store service to allow it to build indexes without
having full SELECT permissions on the tables.
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.
This is the first change in the size based load balancing series.
Closesscylladb/scylladb#26035
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.
Fixes: scylladb/scylladb#26320
Broken documentation link fix for the tool help output, needs backport to all live source-available versions.
Closesscylladb/scylladb#26322
* github.com:scylladb/scylladb:
tools/scylla-sstable: fix doc links
release: adjust doc_link() for the post source-available world
tools/scylla-nodetool: remove trailing " from doc urls
This reference was only needed to facilitate get_built_indexes handler
to work. Now it's gone and the sys.ks. reference is no longer needed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The handler effectively works with the view_builder and should be
registerd in the block that has this service captured.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
The old logic assumes that replicas are spread across whole DC when
determining how many tablets we need to have at least 10 tablets per
shard. If replicas are actually confined to a subset of racks, that
will come up with a too high count and overshoot actual per-shard
count in this rack.
Similar problem happens for scaling-down of tablet count, when we try
to keep per-shard tablet count below the goal. It should be tracked
per-rack rather than per-DC, since racks can differ in how loaded they
are by RF if it's a rack-list.
Encode the dc identifier into each rack name so each dc will have its
own unique racks.
Just for easier distinction in logs.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
There are several problems with how ALTER execution works with tablets.
1) Currently, option processing bypasses
ks_prop_defs::prepare_options(), and will pass them directly to
keyspace_metadata. This deviates from the vnode path, causing
discrepancy in logic. But also there will be some non-trivial options
post-processing added there - numeric RF will be replaced with a
rack list. We should preserve it in the tablet path which alters
the keyspace, otherwise it will fail when trying to construct
network_topology_strategy.
2) Option merging happens on the flat version of the map, which won't
work correctly with extended map which contains lists. We want
the new list to replace the old list or numeric RF, not get its items
merged. For example:
We want:
{'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}
If we merge flattened options, we would get incorrect flattened options:
{'dc1': 3,
'dc1:0', 'rack1'
'dc1:1', 'rack2'}
3) We lose atomicity of update. Validation and merging which happens on the CQL
coordinator is done in a different group0 transaction context than mutation
generation inside topology coordinator later.
Fixes https://github.com/scylladb/scylladb/issues/25269
In 2d9b8f2, semantics of ALTER was changed for tablet-based keyspaces
which makes "replication" assignment act like +=, where replication
options are merged with the old options.
This merging is currently performed in the CQL statement level on
options map, before passing to topology coordinator. This will change
in later commit, so move merging here. Merging options of flattened
level will not be correct because it doesn't recognize nested
collections, like rack lists.
We want:
{'dc1': 3} + {'dc1': ['rack1', 'rack2']} = {'dc1': ['rack1', 'rack2']}
If we merge flattened options, we would get incorrect flattened options:
{'dc1': 3,
'dc1:0', 'rack1'
'dc1:1', 'rack2'}
Which cannot be parsed back into ks_prop_defs on the topology coordinator.
Refs https://github.com/scylladb/scylladb/pull/20208#issuecomment-3174728061
Refs #25549
Before, we would throw vague sstring_out_of_range from substr() when
the name doesn't have a nested key separate with ":", e.g "durable_writes"
instead of "durable_writes:durable_writes".
Allows per-DC replication factor to be either a string, holding a
numerical value, or a list of strings, holding a list of rack names.
The rack list is not respected yet by the tablet allocator, this is
achieved in subsequent commit.
This changes the format of options stored in the flattened map
in system_schema.keyspaces#replication. Values which are rack lists,
are converted into multiple entries, with the list index appended to
the key with ':' as the separator:
For example, this extended map:
{
'dc1': '3',
'dc2': ['rack1', 'rack2']
}
is stored as a flattened map:
{
'dc1': '3',
'dc2:0': 'rack1',
'dc2:1': 'rack2'
}
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Signed-off-by: Tomasz Grabiec <tgrabiec@scylladb.com>
When walking the free-list of a pool or a span, the small-object code
casts the dereferenced `free_object*` to `void*`. This is unnecessary,
just use the `next` field of the `free_object` to look up the next free
object. I think this monkey business with `void*` was done to speed up
walking the free-list, but recently we've seen small-object --summarize
fail in CI, and it could be related.
Fixes: #25733Closesscylladb/scylladb#26339
Before, the `nodetool getendpoints` expected the key as one string separated by : (for example 1:val:ue). This caused errors if any part of the key had a colon because it was unclear whether a colon was a separator or part of the key.
This change adds a new API endpoint, `/storage_service/natural_endpoints/v2/{keyspace}`, which accepts composite partition keys as multiple key_component query parameters (e.g., ?key_component=1&key_component=val:ue). The `nodetool getendpoints` command was updated to support a new `--key-components` option, allowing users to pass key components as an array. The client and test infrastructure were extended to support multiple values for a query parameter, and tests were added to verify correct behavior with composite keys.
The previous method of passing partition keys as colon-separated strings is preserved for backward compatibility.
Backport is not required, since this change relies on recent Seastar updates
Fixes#16596Closesscylladb/scylladb#26169
* github.com:scylladb/scylladb:
docs: document --key-components option for getendpoints
test/nodetool/test_getendpoints: add coverage for --key-components param in getendpoints
nodetool: Introduce new option --key-components to specify compound partition keys as array
rest_api/test_storage_service: add v2 natural_endpoints test for composite key with multiple components
api/storage_service: add GET 'natural_endpoints' v2 to support composite keys with ':'
rest_api_mock: support duplicate query parameters
test/rest_api: support multiple query values per key in RestApiSession.send()
nodetool: add support of new seastar query_parameters_type to scylla_rest_client
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.
Fixes: #26328Closesscylladb/scylladb#26349
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).
Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.
Fixes#24198.
Closesscylladb/scylladb#26353
Some tests expect this error. Later, prepare_options() will be changed
in a way which would fail to accept new options in such case before
vnode/tablet flavor change is detected, tripping the tests.
It will become more complex when options will contain rack lists.
It's a good change regardless, as it reduces duplication and makes
parsing uniform. We already diverged to use stoi / stol / stoul.
The change in create_keyspace_statement.cc to add a catch clause is
needed because get_replication_factor() now throws
configuration_exception on parsing errors instead of
std::invalid_argument, so the existing catch clause in the outer scope
is not effective. That loop is trying to interpret all options as RF
to run some validations. Not all options are RF, and those are
supposed to be ignored.
In preparation for changing their structure.
1) std::map<sstring, sstring> -> replication_strategy_config_options
Parsed options. Values will become std::variant<sstring, rack_list>
2) std::map<sstring, sstring> -> property_definitions::map_type
Flattened map of options, as stored system tables.
Adds a parameterized test to verify that multiple --key-components arguments
are handled correctly by nodetool's getendpoints command. Ensures the
constructed REST request includes all key_component values in the expected format.
Allows getendpoints to accept components of partition key using the --key-components option.
Key components are passed as an array and sent to the new /natural_endpoints/v2/{keyspace} endpoint.
Adds a test case for the `/storage_service/natural_endpoints/v2/{keyspace}` endpoint,
verifying that it correctly resolves natural endpoints for a composite partition key
passed as multiple `key_component` query parameters.
The original `/storage_service/natural_endpoints` endpoint uses colon-separated strings for composite keys,
which causes ambiguity when key components contained colons.
This commits adds a new `/storage_service/natural_endpoints/v2/{keyspace}` endpoint that accepts partition key components
via repeated `key_component` query parameters to avoid this issue.
Previously, only the last value of a repeated query parameter was captured,
which could cause inaccurate request matching in tests. This update ensures
that all values are preserved by storing duplicates as lists in the `params` dict.
Previously, the send() method in RestApiSession only supported one value per query parameter key.
This patch updates it to support passing lists of values, allowing the same key to appear multiple
times in the query string (e.g. ?key=value1&key=value2).
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The possible benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
and lazy_comparable_bytes_from_ring_position
so that they performs the key translation eagerly,
all components to a single bytes_ostream in one synchronous call.
perf_bti_key_translation (microbenchmark added in this series, 1 iteration is 100 translations of a clustering key with 8 cells of int32_type):
```
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
```
Enhancement, backport not required.
Closesscylladb/scylladb#26302
* github.com:scylladb/scylladb:
sstables/trie: BTI-translate the entire partition key at once
sstables/trie: avoid an unnecessary allocation of std::generator in last_block_offset()
sstables/trie: perform the BTI-encoding of position_in_partition eagerly
types/comparable_bytes: add comparable_bytes_from_compound
test/perf: add perf_bti_key_translation
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.
We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.
Fixesscylladb/scylladb#23030
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:
* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.
Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.
We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.
We add a validation test to verify the changes.
Refs scylladb/scylladb#23030
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.
Currently, replica::tablet_map_to_mutation generates a mutation having a row per tablet.
With enough tablets (10s of thousands) in the table we observe reactor stalls when freezing / unfreezing such large mutations, as seen in https://github.com/scylladb/scylladb/pull/18095#issuecomment-2029246954, and I assume we would see similar stalls also when converting those mutation into canonical_mutation and back, as they are similar to frozen_mutation, and bit more expensive since they also save the column mappings.
This series takes a different approach than allowing freeze to yield.
`tablet_map_to_mutation` is changed to `tablet_map_to_mutations`, able to generate multiple split mutations, that when squashed together are equivalent to the previously large mutation. Those mutations are fed into a `process_mutation` callback function, provided by the caller, which may add those mutation to a vector for further processing, and/or process them inline by freezing or making a canonical mutation.
In addition, split the large mutations would also prevent hitting the commitlog maximum mutation size.
Closesscylladb/scylladb#18162
* github.com:scylladb/scylladb:
schema_tables: convert_schema_to_mutations: simplify check for system keyspace
tablets: read_tablet_mutations: use unfreeze_and_split_gently
storage_service: merge_topology_snapshot: freeze snp.mutations gently
mutation: async_utils: add unfreeze_and_split_gently
mutation: add for_each_split_mutation
tablets: tablet_map_to_mutations: maybe split tablets mutation
tablets: tablet_map_to_mutations: accept process_func
perf-tablets: change default tables and tablets-per-table
perf-tablets: abort on unhandled exception
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:
`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`
We handle it separately in the following commit.
Currently, the function unfreezes each schema mutation partition
and then checks if it's for a system keyspace.
This isn't really needed since we can check the partition key
using the frozen_mutation, skip it if the partition is for a system keyspace.
Note that the constructed partition_key just copies
the frozen partition_key_view, without copying or deserializing the
actual key contents.
Also, reserve `results` capacity using the queried
partitions' size to prevent reallocations of the results
vector.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
We don't need to store all snp.mutations in a vector
and then freeze the whole vector. They can be frozen
one at a time and collected into a vector, while
maybe yielding between each mutation to prevent
stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Unfreeze the frozen_mutation, possibly splitting it
based on max_rows. The process_mutation function
is called for each split mutation.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Allows processing of the split mutations one at a time.
This can reduce memory footprint as the caller
won't have to store a vector of the split mutations
and then convert it (e.g. freeze the mutations
or convert them to canonical mutations).
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.
Note that this will convert large mutation to long
vectors of mutations. A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.
This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.
Next patch will split large tablet mutations to prevent stalls.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
tablets-per-table must be a power of 2, so round up 10000 to 16K.
also, reduce number of tables to have a total of about 100K
tablets, otherwise we hit the maximum commitlog mutation size
limit in save_tablet_metadata.
Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
The directory utils/ is supposed to contain general-purpose utility
classes and functions, which are either already used across the project,
or are designed to be used across the project.
This patch moves 8 files out of utils/:
utils/advanced_rpc_compressor.hh
utils/advanced_rpc_compressor.cc
utils/advanced_rpc_compressor_protocol.hh
utils/stream_compressor.hh
utils/stream_compressor.cc
utils/dict_trainer.cc
utils/dict_trainer.hh
utils/shared_dict.hh
These 8 files together implement the compression feature of RPC.
None of them are used by any other Scylla component (e.g., sstables have
a different compression), or are ready to be used by another component,
so this patch moves all of them into message/, where RPC is implemented.
Theoretically, we may want in the future to use this cluster of classes
for some other component, but even then, we shouldn't just have these
files individually in utils/ - these are not useful stand-alone
utilities. One cannot use "shared_dict.hh" assuming it is some sort of
general-purpose shared hash table or something - it is completely
specific to compression and zstd, and specifically to its use in those
other classes.
Beyond moving these 8 files, this patch also contains changes to:
1. Fix includes to the 5 moved header files (.hh).
2. Fix configure.py, utils/CMakeLists.txt and message/CMakeLists.txt
for the three moved source files (.cc).
3. In the moved files, change from the "utils::" namespace, to the
"netw::" namespace used by RPC. Also needed to change a bunch
of callers for the new namespace. Also, had to add "utils::"
explicitly in several places which previously assumed the
current namespace is "utils::".
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25149
This PR migrates limits tests from dtest to this repository.
One reason is that there is an ongoing effort to migrate tests from dtest to here.
Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.
Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.
scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)
Fixes#25097
This is a migration of existing tests to this repository. No need for backport.
Closesscylladb/scylladb#26077
* github.com:scylladb/scylladb:
test: dtest: limits_test.py: test_max_cells log level
test: dtest: limits_test.py: make the tests work
test: dtest: test_limits.py: remove test that are not being migrated
test: dtest: copy unmodified limits_test.py
Corrected spelling mistakes, typos, and minor wording issues to improve
the developer documentation.
No backport: There is no functional change, and the doc is mostly
relevant to master, so it doesn't need to be backported.
Closesscylladb/scylladb#26332
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.
That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.
To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.
Refs scylladb/scylladb#23958
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.
This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.
Refs #25097
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.
Disable it for `debug`, `dev`, and `release` mode.
Refs #25097
The query namespace is used for symbols which span the coordinator and
replica, or that are mostly coordinator side. The querier is mainly in
this namespace due to its similar name, but this is a mistake which
confuses people. Now that the code was moved to replica/, also fix the
namespace to be namespace replica.
The querier object is a confusing one. Based on its name it should be in
the query/ module and it is already in the query namespace. But this is
actually a completely replica-side logic, implementing the caching of
the readers on the replica. Move it to the replica module to make this
more clear.
Delaying the BTI encoding of partition keys is a good idea,
because most of the time they don't have to be encoded.
Usually the token alone is enough for indexing purposes.
But for the translation of the `partition_key` part itself,
there's no good reason to make it lazy,
especially after we made the translation of clustering keys
eager in a previous commit. Let's get rid of the `std::generator`
and convert all cells of the partition key in one go.
Applying lazy evaluation to the BTI encoding of clustering keys
was probably a bad default.
The benefits are dubious (because it's quite likely that the laziness
won't allow us to avoid that much work), but the overhead needed to
implement the laziness is large and immediate.
In this patch we get rid of the laziness.
We rewrite lazy_comparable_bytes_from_clustering_position
so that it performs the translation eagerly,
all components to a single bytes_ostream.
Note: the name *lazy*_comparable_bytes_from_clustering_position
stays, because the interface is still lazy.
perf_bti_key_translation:
Before:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 9233 109.930us 0.000ns 109.930us 109.930us 4356.000 0.000 2615394.3 614709.6
After:
test iterations median mad min max allocs tasks inst cycles
lcb_mismatch_test.lcb_mismatch 50952 19.487us 0.000ns 19.487us 19.487us 198.000 0.000 603120.1 109042.9
Add a function which converts compound types (keys and key prefixes)
to BTI encoding.
It's almost the same as the existing `lazy_comparable_bytes_from_compound`
(in bti_key_translation.cc), except it eagerly serializes key components
to a bytes_ostream instead of lazily yielding them from a generator.
We will remove `lazy_comparable_bytes_from_compound` in a later commit.
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.
the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.
Fixesscylladb/scylladb#25290
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.
in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.
This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.
2025-09-25 16:37:04 +02:00
599 changed files with 25682 additions and 8507 deletions
@@ -12,7 +12,7 @@ Please use the [issue tracker](https://github.com/scylladb/scylla/issues/) to re
## Contributing code to Scylla
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
Before you can contribute code to Scylla for the first time, you should sign the [Contributor License Agreement](https://www.scylladb.com/open-source/contributor-agreement/) and send the signed form to cla@scylladb.com. You can then submit your changes as patches to the [scylladb-dev mailing list](https://groups.google.com/forum/#!forum/scylladb-dev) or as a pull request to the [Scylla project on github](https://github.com/scylladb/scylla).
If you need help formatting or sending patches, [check out these instructions](https://github.com/scylladb/scylla/wiki/Formatting-and-sending-patches).
The Scylla C++ source code uses the [Seastar coding style](https://github.com/scylladb/seastar/blob/master/coding-style.md) so please adhere to that in your patches. Note that Scylla code is written with `using namespace seastar`, so should not explicitly add the `seastar::` prefix to Seastar symbols. You will usually not need to add `using namespace seastar` to new source files, because most Scylla header files have `#include "seastarx.hh"`, which does this.
and other sanity checks. It has no optimizations, which allows for debugging with tools like
@@ -361,7 +361,7 @@ avoid that the gold linker can be told to create an index with
More info at https://gcc.gnu.org/wiki/DebugFission.
Both options can be enable by passing `--split-dwarf` to configure.py.
Both options can be enabled by passing `--split-dwarf` to configure.py.
Note that distcc is *not* compatible with it, but icecream
(https://github.com/icecc/icecream) is.
@@ -370,7 +370,7 @@ Note that distcc is *not* compatible with it, but icecream
Sometimes Scylla development is closely tied with a feature being developed in Seastar. It can be useful to compile Scylla with a particular check-out of Seastar.
One way to do this it to create a local remote for the Seastar submodule in the Scylla repository:
One way to do this is to create a local remote for the Seastar submodule in the Scylla repository:
slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}",e.what(),user,client_address);
}
throwstd::move(e);
}else{
if(warn_authorization){
slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}",e.what(),user,client_address);
seastar::metrics::make_histogram("batch_item_count_histogram",seastar::metrics::description("Histogram of the number of items in a batch request"),labels,
seastar::metrics::description("Counts number of misses of cached expressions"),labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()
});
// Only register the following metrics for the global metrics, not per-table
co_returnapi_error::unknown_operation("UpdateTimeToLive not yet supported. Experimental support is available if the 'alternator-ttl' experimental feature is enabled on all nodes.");
"summary":"This method returns the N endpoints that are responsible for storing the specified key i.e for replication. the endpoint responsible for this key",
"type":"array",
"items":{
"type":"string"
},
"nickname":"get_natural_endpoints_v2",
"produces":[
"application/json"
],
"parameters":[
{
"name":"keyspace",
"description":"The keyspace to query about.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"path"
},
{
"name":"cf",
"description":"Column family name.",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"key_component",
"description":"Each component of the key for which we need to find the endpoint (e.g. ?key_component=part1&key_component=part2).",
"description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -984,7 +1055,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -994,6 +1065,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -1100,6 +1195,14 @@
"allowMultiple":false,
"type":"string",
"paramType":"query"
},
{
"name":"drop_unfixable_sstables",
"description":"When set to true, drop unfixable sstables. Applies only to scrub mode SEGREGATE.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -1519,6 +1622,30 @@
}
]
},
{
"path":"/storage_service/exclude_node",
"operations":[
{
"method":"POST",
"summary":"Marks the node as permanently down (excluded).",
"type":"void",
"nickname":"exclude_node",
"produces":[
"application/json"
],
"parameters":[
{
"name":"hosts",
"description":"Comma-separated list of host ids to exclude",
"required":true,
"allowMultiple":false,
"type":"string",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/removal_status",
"operations":[
@@ -2924,7 +3051,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to incremental mode.",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
throwexceptions::invalid_request_exception(format("Cannot create CDC log for a table {}.{}, because the keyspace uses tablets, and not all nodes support the CDC with tablets feature.",
throwexceptions::invalid_request_exception(format("Cannot add column {} because a column with the same name was dropped too recently. Please retry after {} seconds",
utils::get_local_injector().inject("rest_api_keyspace_scrub_abort",[]{throwcompaction_aborted_exception("","","scrub compaction found invalid data");});
exceptions::invalid_request_exception("'replication_factor' tag is not allowed when executing ALTER KEYSPACE with tablets, please list the DCs explicitly"));
throwexceptions::configuration_exception(sstring("Missing sub-option '")+compression_parameters::SSTABLE_COMPRESSION+"' for the '"+KW_COMPRESSION+"' option.");
throwexceptions::invalid_request_exception(format("index names shouldn't be more than {:d} characters long (got \"{}\")",schema::NAME_LENGTH,_index_name.c_str()));
throwexceptions::invalid_request_exception(format("Cannot use the 'counter' type for table {}.{}: Counters are not yet supported with tablets",keyspace(),cf_name));
// Expand: the user switched from another strategy that specified a 'replication_factor'
// and didn't provide any additional options.
rf=it->second;
rf=std::get<sstring>(it->second);
}
}
if(rf&&uses_tablets&&is_alter){
throwexceptions::invalid_request_exception("'replication_factor' tag is not allowed when executing ALTER KEYSPACE with tablets, please list the DCs explicitly");
throwexceptions::syntax_exception(seastar::format("Invalid map value '{}' for key '{}'. It should be a simple string.",std::get<list_type>(x.second),x.first));
co_awaitcoroutine::return_exception(exceptions::invalid_request_exception(fmt::format("Use of ANN OF in an ORDER BY clause requires a LIMIT that is not greater than {}. LIMIT was {}",max_ann_query_limit,limit)));
Some files were not shown because too many files have changed in this diff
Show More
Reference in New Issue
Block a user
Blocking a user prevents them from interacting with repositories, such as opening or commenting on pull requests or issues. Learn more about blocking a user.