This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.
Closesscylladb/scylladb#25685
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.
Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.
Closesscylladb/scylladb#25602
Trying to run the test with more than one shard results in a failure
when generating sharding metadata:
```
ERROR 2025-08-27 16:00:17,551 [shard 0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting
```
Let's require that the test be run with a single shard.
Closesscylladb/scylladb#25703
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.
Closesscylladb/scylladb#25726
Following up on 6129411a5e
improve test_vnode_keyspace_describe_ring be verifying that the
endpoints listed by describe_ring match those returned by the
`natural_endpoints` api (for random tokens).
The latter are calculated using an independent code path
directly from the effective_replication_map.
* test exists currently only on master, no backport required
Closesscylladb/scylladb#25610
* github.com:scylladb/scylladb:
test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints
test/pylib/rest_client: add natural_endpoints function
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.
This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#25603
* github.com:scylladb/scylladb:
types/comparable_bytes: add compatibility testcases for collection types
types/comparable_bytes: update compatibility testcase to support collection types
types/comparable_bytes: support empty type
types/comparable_bytes: support reversed types
types/comparable_bytes: support vector cql3 type
types/comparable_bytes: support tuple and UDT cql3 type
types/comparable_bytes: support map cql3 type
types/comparable_bytes: support set and list cql3 types
types/comparable_bytes: introduce encode/decode_component
types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.
The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.
Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.
Fixes#25701
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25705
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.
We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.
The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.
We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.
Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.
Fixes: #6931Closesscylladb/scylladb#24903
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
The storage submodule contains tests that require mounted volumes
to be executed. The volumes are created automatically with the
`volumes_factory` fixture.
The tests in this suite are executed with the custom launcher
`unshare -mr pytest`
Test scenarios (when one node reaches critical disk utilization):
1. Reject user table writes
2. Disable/Enabled compaction
3. Reject split compactions
4. New split compactions not triggered
5. Abort tablet repair
6. Disable/Enabled incoming tablet migrations
7. Restart a node while a tablet split is triggered
Currently, workdir is set in ScyllaCluster constructor and it does
not take into accout that the value could be overridden via cmdline
arguments. When this happens, then some data (logs, configs) are
stored under one path and other (data) is stored under a different.
The patch allows overriding the value when passed via cmdline arguments
leading to all files being stored under the same path.
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.
The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
This patch adds compatibility testcases for the following cql3 types :
set, list, map, tuple, vector and reversed types.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The `abstract_type::from_string()` method used to parse the input data
doesn't support collections yet. So the collection testdata will be
passed as JSON strings to the testcase. This patch updates the testcase
to adapt to this workaround.
Also, extended the testcase to verify that Scylla's implementation can
successfully decode the byte comparable output encoded by Cassandra.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
A reversed type is first encoded using the underlying type and then all
the bits are flipped to ensure that the lexicographical sort order is
reversed. During decode, the bytes are flipped first and then decoded
using the underlying type.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL vector type encoding is similar to the lists, where each element
is transformed into a byte-comparable format and prefixed with a
component marker. The sequence is terminated with a terminator marker to
indicate the end of the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL tuple and UDT types share the same internal implementation and
therefore use the same byte comparable encoding. The encoding is similar
to lists, where each element is transformed into a byte-comparable
format and prefixed with a component marker. The sequence is terminated
with a terminator marker to indicate the end of the collection.
TODO: Add duplicate test items to maps, lists and sets
For maps, add more entries that share keys
ex map1 : key1 : value1, key2 : value2
map2 : key1 : value4
map3 : key2 : value5 etc
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL map type is encoded as a sequence of key-value pairs. Each key
and each value is individually prefixed with a component marker, and the
sequence is terminated with a terminator marker to indicate the end of
the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL set and list types are encoded as a sequence of elements, where
each element is transformed into a byte-comparable format and prefixed
with a component marker. The sequence is terminated with a terminator
marker to indicate the end of the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The components of a collection, such as an element from a list, set, or
vector; a key or value from a map; or a field from a tuple, share the
same encode and decode logic. During encode, the component is transformed
into the byte comparable format and is prefixed with the `NEXT_COMPONENT`
marker. During decode, the component is transformed back into its
serialized form and is prefixed with the serialized size.
A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and
during decode, a `-1` is written to the serialized output.
This commit introduces few helper methods that implement the above
mentioned encode and decode logics.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.
The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.
Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.
Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.
The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.
To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.
Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.
If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.
There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.
The relevant portion of the initialization and deinitialization flow:
(a) Before the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.
(b) After the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.
(*):
The reference to `auth::service` in `service_level_controller` is still
necessary. We need to access the service when dropping a distributed
service level.
Although it would be best to cut that link between the service level
controller and `auth::service` too, effectively separating the entities,
it would require more work, so we leave it as-is for now.
It shouldn't prove problematic as far as accessing an uninitialized service
goes. Trying to drop a service level at the point when we're de-initializing
auth should be impossible.
For more context, see the function `drop_distributed_service_level` in
`service_level_controller`.
A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.
Fixesscylladb/scylladb#24792
Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.
Closesscylladb/scylladb#25478
* github.com:scylladb/scylladb:
service/qos: Move effective SL cache to auth_integration
service/qos: Add auth::service to auth_integration
service/qos: Reload effective SL cache conditionally
service/qos: Add gate to auth_integration
service/qos: Introduce auth_integration
- Disable tablets in `test_migration_on_existing_raft_topology`.
Because views on tablets are experimental now, we can safely
assume that view building coordinator will start with view build status
on raft.
- Add error injection to pause view building on worker.
Used to pause view building process, there is analogous error injection
in view_builder.
- Do a read barrier in `test_view_in_system_tables`
Increases test stability by making sure that the node sees up-to-date
group0 state and `system.built_views` is synced.
- Wait for view is build in some tests
Increases tests stability by making sure that the view is built.
- Remove xfail marker from `test_tablet_streaming_with_unbuilt_view`
This series fix https://github.com/scylladb/scylladb/issues/21564
and this test should work now.
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.
Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
to `view_update_generator` but create view building tasks for
later
The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
After a few next patches, creating/dropping a view in tablet keyspace
will require a remote proxy to obtain references to system keyspace
and view building state.
Because of this, remote proxy needs to be explicitly enabled in boost
tests which create views.
The state may be also reloaded on `topology_change` or `mixed_change`
because topology coordinator may change view building tasks during
tablet operations.
When a vector index is created in Scylla, it is initially built using a full scan of the database. After that, it stays up to date by tracking changes through CDC, which should be automatically enabled when the vector index is created.
When a user attempts to enable Vector Search (VS), the system checks whether Change Data Capture (CDC) is enabled and properly configured:
1. CDC is not enabled
- CDC is automatically enabled with the minimum required TTL (Time-to-Live) for VS (24 hours) and the delta mode set to 'full' or post-image is enabled.
- If the user later tries to reduce the CDC TTL below 24 hours or set delta mode to 'keys' with post-image disabled, the action fails.
- Error message: Clearly states that CDC TTL must be at least 24 hours and delta mode must be set to 'full' or post-image must be enabled for VS to function.
2. CDC is already enabled
- If CDC TTL is ≥ 24 hours and delta mode is set to 'full' or post-image is enabled: VS is enabled successfully.
- If CDC TTL is < 24 hours or delta mode is set to 'keys' with post-image disabled: The VS enabling process fails.
- Error message: Informs the user that CDC TTL must be at least 24 hours, delta mode must be set to 'full' or post-image must be enabled, and provides a link to documentation on how to update the TTL, delta mode, and post-image.
When a user attempts to disable CDC when VS is enabled, the action will fail and the user will be informed by error message that clearly states that VS needs to be disabled (vector indexes have to be dropped) first.
Full setup requirements and steps will be detailed in the documentation of Vector Search.
Co-authored-by: @smoczy123
Fixes: VECTOR-27
Fixes: VECTOR-25
Closesscylladb/scylladb#25179
* github.com:scylladb/scylladb:
test/cqlpy: ensure Vector Search CDC options
test/boost: adjust CDC boost tests for Vector Search
test/cql: add Vector Search CDC enable/disable test
cdc, vector_index: provide minimal option setup for Vector Search
test/cqlpy: adjust describe table tests with CDC for Vector Search
describe, cdc: adjust describe for cdc log tables
cdc: enable CDC log when vector index is created
test/cqlpy: run vector_index tests only on vnodes
vector_index: check if vector index exists in schema
The new service, `auth_integration`, has taken over the responsibility
over managing effective service levels from `service_level_controller`.
However, before these changes, it still accessed `auth::service` via
the service level controller. Let's change that.
Note that we also remove a check that `auth::service` has been
initialized. It's not necessary anymore because the lifetime of
`auth_integration` is strictly nested within the lifetime of `auth::service`.
In actuality, `service_level_controller` should lose its reference to
`auth::service` completely. All of the management over effective service
levels has already been moved to `auth_integration`. However, the
referernce is still needed when dropping a distributed service level
because we need to update the corresponding attribute for relevant
roles.
That should not lead to invalid accesses, though. Dropping a service level
should not be possible when `auth::service` is not initialized.
Since `service_level_controller` outlives `auth_integration`, it may
happen that we try to access it when it has already been deinitialized.
To prevent that, we only try to reload or clear the effective service
level cache when the object is still alive.
These changes solve an existing problem with an invalid memory access.
For more context, see issue scylladb/scylladb#24792.
We provide a reproducer test that consistently fails before these
changes but passes after them.
Fixesscylladb/scylladb#24792
We introduce a new type, `auth_integration`, that will be used internally
by `service_level_controller`. Its purpose is to take over the responsibility
over managing effective service levels.
The main problem of the current implementation of service level controller
is its dependency on `auth::service` whose lifetime is strictly nested
within the lifetime of service level controller. That may and already have
led to invalid memory accesses; for an example, see issue
scylladb/scylladb#24792.
Our strategy is to split service level controller into smaller parts and
ensure that we access `auth::service` only when it's valid to do so.
This commit is the first step towards that.
We don't change anything in the logic yet, just add the new type. Further
adjustments will be made in following commits.
This series adds support for a DynamoDB-compatible Write Capacity Unit (WCU) calculation in Alternator by introducing an optional forced read-before-write mechanism.
Alternator's model differs from DynamoDB, and as a result, some write operations may report lower WCU usage compared to what DynamoDB would report. While this is acceptable in many cases, there are scenarios where users may require accurate WCU reporting that aligns more closely with DynamoDB's behavior.
To address this, a new configuration option, alternator_force_read_before_write, is introduced. When enabled, Alternator will perform a read before executing PutItem, UpdateItem, and DeleteItem operations. This allows it to take the existing item size into account when computing the WCU. BatchWriteItem support is also extended to use this mechanism. Because BatchWriteItem does not support returning old items directly, several internal changes were made to support reading previous item sizes with minimal overhead. Reads are performed at consistency level LOCAL_ONE for efficiency, and the WCU calculation is now done in multiple stages to accurately account for item size differences.
In addition to the implementation changes, test coverage was added to validate the new behavior. These tests confirm that WCU is calculated based on the larger of the old and new items when read-before-write is active, including for BatchWriteItem.
This feature comes with performance overhead and is therefore disabled by default. It can be enabled at runtime via the system.config table and should be used only when precise WCU tracking is necessary.
**New feature, no need to backport**
Closesscylladb/scylladb#24436
* github.com:scylladb/scylladb:
alternator/test_returnconsumedcapacity.py: Test forced read before write
alternator/executor.cc: DynamoDB WCU calculation in BatchWriteItem using read-before-write
executor.cc: get_previous_item with consistency level
executor: Extend API of put_or_delete_item
alternator/executor.cc: Accurate WCU for put, update, delete
config: add alternator_force_read_before_write
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
If the option is turned off, we will also log all of the
RF-rack-invalid keyspaces at start-up.
We provide validation tests.
Fixesscylladb/scylladb#23330
Backport: we'd like to encourage the user to abide by the restriction
even when they don't enforce it to make it easier in the future to
adjust the schema when there's no way to disable it anymore. Because
of that, we'd like to backport it to all relevant versions, starting with 2025.1.
Closesscylladb/scylladb#24785
* github.com:scylladb/scylladb:
main: Log RF-rack-invalid keyspaces at startup
cql3/statements: Fix indentation
cql3: Warn when creating RF-rack-invalid keyspace
test_pinned_cl_segment_doesnt_resurrect_data was not moved in #24946 from
scylla-dtest to this repo, because it's marked as xfail (#14879), but
actually the issue is fixed and there is no reason to keep the test in
scylla-dtest.
Also remove unused imports.
Closesscylladb/scylladb#25592
Consider this:
1) merge finishes, wakes up fiber to merge compaction groups
2) drop table happens, which in turn invokes truncate underneath
3) merge fiber stops old groups
4) truncate disables compaction on all groups, but the ones stopped
5) truncate performs a check that compaction has been disabled on
all groups, including the ones stopped
6) the check fails because groups being stopped didn't have compaction
explicitly disabled on them
To fix it, the check on step 6 will ignore groups that have been
stopped, since those are not eligible for having compaction explicitly
disabled on them. The compaction check is there, so ongoing compaction
will not propagate data being truncated, but here it happens in the
context of drop table which doesn't leave anything behind. Also, a
group stopped is somewhat equivalent to compaction disabled on it,
since the procedure to stop a group stops all ongoing compaction
and eventually removes its state from compaction manager.
Fixes#25551.
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
Closesscylladb/scylladb#25563
After changing the type of the `recovery_leader` config option from
`sstring` to `UUID` in #25032, setting `recovery_leader` to an empty
string became an incorrect way to unset it. The following error started
to appear in the recovery procedure tests:
```
init - marshaling error: UUID string size mismatch: '' : recovery_leader
```
We unset `recovery_leader` properly in this PR. To do it, we introduce
a simple way to remove config options in tests.
Backport is unneeded. This error was harmless, and Scylla ignored
`recovery_leader` after logging the error as expected by the tests.
Closesscylladb/scylladb#25365
* github.com:scylladb/scylladb:
test: properly unset recovery_leader in the recovery procedure tests
test: manager_client: allow removing a config option
test: manager_client: add docstring to server_update_config
Some cluster tests use `cluster_con` when they need a different load
balancing policy or auth provider. However, no test uses a port
other than 9042 or enables SSL, but all tests must pass `9042, False`
because these parameters don't have default values. This makes the code
more verbose. Also, it's quite obvious that 9042 stands for port, but
it's not obvious what `False` is related to, so there is a need to check
the definition of `cluster_con` while reading any test that uses it.
No reason to backport, it's only a minor refactoring.
Closesscylladb/scylladb#25516
Before this change, `saslauthd_authenticator` prevented dropping
roles. The current documentation instructs users to `Ensure Scylla has
the same users and roles as listed in the LDAP directory`. Therefore,
ScyllaDB should allow dropping roles so administrators can remove
obsolete roles from both LDAP and ScyllaDB.
The code change is minimal — dropping a role is a no-op, similar to the
existing no-op implementations for successful `create` and `alter`
operations.
`saslauthd_authenticator_test` is updated to verify that dropping
a role doesn't throw anymore.
Fixes: scylladb/scylladb#25571Closesscylladb/scylladb#25574
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.
We provide a validation test.
Although RF-rack-valid keyspaces are not universally enforced
yet (they're governed by the configuration option
`rf_rack_valid_keyspaces`), we'd like to encourage the user to
abide by the restriction.
To that end, we're introducing a warning when creating or
altering a keyspace. If the configuration option is disabled,
but the user is trying to create an RF-rack-invalid keyspace,
they'll receive a warning.
We provide a validation test.
There is the stash item REPEATED_FILES for directory items which used to cut
recursion. But if multiple tests from one directory added to ./test.py
commandline this solution prevents handling non-first tests well because
it was already collected for the first one. Change behavior to not store
all repeated files in the stash but just files which are in the process
of repetition. Rename the stash item to REPEATING_FILES to reflect this
change.
Closesscylladb/scylladb#25611