Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying
materialized view of a secondary index would produce a `CREATE INDEX` statement.
It was not only confusing, but it also prevented from learning about
the definition of the view. The only way to do so was to query system tables.
We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement
instead. The statement is printed as a comment to implicitly convey that
the user should not attempt to execute it to restore the view. A short comment
is provided to make it clearer.
Before this commit:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;
CREATE INDEX i ON ks.t(v);
```
After this commit:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;
/* Do NOT execute this statement! It's only for informational purposes.
This materialized view is the underlying materialized view of a secondary
index. It can be restored via restoring the index.
CREATE MATERIALIZED VIEW ks.i_index [...];
*/
```
Note that describing the base table has not been affected and still works
as follows:
```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE TABLE ks.t;
CREATE TABLE ks.t (
p int,
v int,
PRIMARY KEY (p)
) WITH bloom_filter_fp_chance = 0.01
AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
AND comment = ''
AND compaction = {'class': 'IncrementalCompactionStrategy'}
AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
AND crc_check_chance = 1
AND default_time_to_live = 0
AND gc_grace_seconds = 864000
AND max_index_interval = 2048
AND memtable_flush_period_in_ms = 0
AND min_index_interval = 128
AND speculative_retry = '99.0PERCENTILE'
AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};
CREATE INDEX i ON ks.t(v);
```
We also provide two reproducers of scylladb/scylladb#24610.
Fixesscylladb/scylladb#24610Closesscylladb/scylladb#25697
Determine the progress of compaction tasks that have
children.
The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)
If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.
All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.
Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.
Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.
New feature, no backport needed
Closesscylladb/scylladb#15158
* github.com:scylladb/scylladb:
test: add compaction task progress test
compaction: set progress unit for compaction tasks
compaction: find expected workload for reshard tasks
compaction: find expected workload for global cleanup compaction tasks
compaction: find expected workload for global major compaction tasks
compaction: find expected workload for keyspace compaction tasks
compaction: find expected workload for shard compaction tasks
compaction: find expected workload for table compaction tasks
compaction: return empty progress when compaction_size isn't set
compaction: update compaction_data::compaction_size at once
tasks: do not check expected workload for done task
When creating a new keyspace, replication factor must be stated.
For example:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };`
This patch changes it in the following way - if there is no
replication factor specified, use default replication factor.
Default replication factor is equal to the number of racks that
are not arbiter-only, i.e. racks that have at least one non-arbiter node.
The following syntax is now valid:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };`
`CREATE KEYSPACE ks WITH REPLICATION { };`
Fixes#16028
Backport is not needed. This is an enhancement for future releases.
Closesscylladb/scylladb#25570
* github.com:scylladb/scylladb:
docs/cql: update documentation for default replication factor
test/cqlpy: add keyspace creation default replication factor tests
cql3: add default replication factor to `create_keyspace_statement`
Fixes#25683
Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.
Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).
Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.
Small unit test included.
Closesscylladb/scylladb#25699
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.
This change adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.
Fixesscylladb/scylladb#25702Fixesscylladb/scylladb#25621
Ref scylladb/scylla-enterprise#5613Closesscylladb/scylladb#25727
Adds infrastructure and client for interaction with GCP object storage services.
Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later.
This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some.
Test is added, though could be more comprehensive (feel free to suggest a test vector).
Test can run in either local mock server mode (default), or against actual GCP.
See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars.
Default is to run the test against a temporary docker deamon.
Closesscylladb/scylladb#24629
* github.com:scylladb/scylladb:
test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
proc-utils: Re-export waiting types from seastar
proc-utils: Inherit environment from current process
utils::gcp::object_storage: Add client for GCP object storage
utils::http: Add optional external credentials to dns_connection_factory init
utils::rest: Break out request wrapper and send logic
encryption::gcp_host: Use shared gcp credentials + REST helpers
utils::gcp: Move/add gcp credentials management to shared file
utils::rest::client: Add formatter for seastar::http::reply
utils::rest::client: Add helper routines for simple REST calls
utils::http: Make shared system trust certificates public
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.
One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]
That may have and did cause trouble. Consider this scenario:
1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
table just like any other one (at least as far as the relevant portion of the
logic is concerned). Among other things, it uses
`cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
property is set to some value:
* the default one if the user doesn't specify it in the statement,
* a custom one if they do.
Why is that a problem?
First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).
There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
statement will miss the `tombstone_gc` property as if it couldn't be set for
it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
will include the `tombstone_gc` property. That's even more confusing (why was
it not present the first time, but it is now?).
The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.
The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:
1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.
The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.
Two reproducer tests of scylladb/scylladb#25187 are included in the changes.
Backport: The problem is not critical, so it may not be necessary to backport the changes.
That's to be discussed.
Closesscylladb/scylladb#25521
* github.com:scylladb/scylladb:
cdc: Set tombstone_gc when creating log table
tombstone_gc: Add overload of get_default_tombstone_gc_mode
tombstone_gc: Rename get_default_tombstonesonte_gc_mode
Consider the following scenario:
- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash
Fixes: #25706
This needs to be backported to all supported versions with tablets
Closesscylladb/scylladb#25708
* github.com:scylladb/scylladb:
test: reproducer and test for drop with concurrent cleanup
truncate: check for closed storage group's gate in discard_sstables
Allows testing using either local mock server (installed or using docker),
or real GCP project (not tested as of writing this).
v2: Try podman if docker unavail
v3: Ensure we check log output on fake-gcs, because when using podman, the
published port will be connectible even though the actual server is not
up yet.
v4: Use ephermal port forward in docker/podman to allow us running parallel
instances. Also adjust credentials and port finding in test.
v5: Re-ensure no parallel tests for this: We seem to time out in podman
trying to fetch image for X parallel tests
v6: Remove the ephermal port stuff. Because of course this does not work
with our podman-in-podman. Do brute-force port speculation instead.
v7: Up timeout for server start to allow docker pull.
v8: Fix string check error
v9: Add explicit docker image version
Executing a vector search (SELECT with ANN OF ordering) query with `TRACING ON` enabled
caused a node to crash due to a null pointer dereference.
This occurred because a vector index does not have an associated view
table, making its `_view_schema` member null. The implementation
attempted to enable tracing on this null view schema, leading to the
crash.
The fix adds a null check for `_view_schema` before attempting to
enable tracing on the view (index) table.
A regression test is included to prevent this from happening again.
Fixes: VECTOR-179
Closesscylladb/scylladb#25500
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.
There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.
Refs: scylladb/scylladb#25621Fixes: scylladb/scylladb#25721
Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.
Closesscylladb/scylladb#25685
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.
Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.
Closesscylladb/scylladb#25602
Trying to run the test with more than one shard results in a failure
when generating sharding metadata:
```
ERROR 2025-08-27 16:00:17,551 [shard 0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting
```
Let's require that the test be run with a single shard.
Closesscylladb/scylladb#25703
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.
Closesscylladb/scylladb#25726
Following up on 6129411a5e
improve test_vnode_keyspace_describe_ring be verifying that the
endpoints listed by describe_ring match those returned by the
`natural_endpoints` api (for random tokens).
The latter are calculated using an independent code path
directly from the effective_replication_map.
* test exists currently only on master, no backport required
Closesscylladb/scylladb#25610
* github.com:scylladb/scylladb:
test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints
test/pylib/rest_client: add natural_endpoints function
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.
This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md
Refs https://github.com/scylladb/scylladb/issues/19407
New feature - backport not required.
Closesscylladb/scylladb#25603
* github.com:scylladb/scylladb:
types/comparable_bytes: add compatibility testcases for collection types
types/comparable_bytes: update compatibility testcase to support collection types
types/comparable_bytes: support empty type
types/comparable_bytes: support reversed types
types/comparable_bytes: support vector cql3 type
types/comparable_bytes: support tuple and UDT cql3 type
types/comparable_bytes: support map cql3 type
types/comparable_bytes: support set and list cql3 types
types/comparable_bytes: introduce encode/decode_component
types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.
The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.
Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.
Fixes#25701
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#25705
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.
We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.
The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.
We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.
Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.
Fixes: #6931Closesscylladb/scylladb#24903
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
- includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
- applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations
The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.
Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e. `--space-limited-dirs`.
When not provided, tests are skipped.
Test scenarios:
1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2
The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).
The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.
`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:
Read path (before)
```
instructions_per_op:
mean= 39661.51 standard-deviation=34.53
median= 39655.39 median-absolute-deviation=23.33
maximum=39708.71 minimum=39622.61
```
Read path (after)
```
instructions_per_op:
mean= 39691.68 standard-deviation=34.54
median= 39683.14 median-absolute-deviation=11.94
maximum=39749.32 minimum=39656.63
```
Write path (before):
```
instructions_per_op:
mean= 50942.86 standard-deviation=97.69
median= 50974.11 median-absolute-deviation=34.25
maximum=51019.23 minimum=50771.60
```
Write path (after):
```
instructions_per_op:
mean= 51000.15 standard-deviation=115.04
median= 51043.93 median-absolute-deviation=52.19
maximum=51065.81 minimum=50795.00
```
Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871
No backport, as it is a new feature.
Closesscylladb/scylladb#23917
* github.com:scylladb/scylladb:
tests/cluster: Add new storage tests
test/scylla_cluster: Override workdir when passed via cmdline
streaming: Reject incoming migrations
storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
repair_service: Add a facility to disable the service
compaction_manager: Subscribe to out of space controller
compaction_manager: Replace enabled/disabled states with running state
database: Add critical_disk_utilization mode database can be moved to
disk_space_monitor: add subscription API for threshold-based disk space monitoring
docs: Add feature documentation
config: Add critical_disk_utilization_level option
replica/exceptions: Add a new custom replica exception
Fixes#25709
If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).
Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.
To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).
Closesscylladb/scylladb#25719
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.
The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.
The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.
The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).
For detailed description check: `docs/dev/view-building-coordinator.md`
Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930
---
This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942Closesscylladb/scylladb#23760
* github.com:scylladb/scylladb:
test/cluster: add view build status tests
test/cluster: add view building coordinator tests
utils/error_injection: allow to abort `injection_handler::wait_for_message()`
test: adjust existing tests
utils/error_injection: add injection with `sleep_abortable()`
db/view/view_builder: ignore `no_such_keyspace` exception
docs/dev: add view building coordinator documentation
db/view/view_building_worker: work on `process_staging` tasks
db/view/view_building_worker: register staging sstable to view building coordinator when needed
db/view/view_building_worker: discover staging sstables
db/view/view_building_worker: add method to register staging sstable
db/view/view_update_generator: add method to process staging sstables instantly
db/view/view_update_generator: extract generating updates from staging sstables to a method
db/view/view_update_generator: ignore tablet-based sstables
db/view/view_building_coordinator: update view build status on node join/left
db/view/view_building_coordinator: handle tablet operations
db/view: add view building task mutation builder
service/topology_coordinator: run view building coordinator
db/view: introduce `view_building_coordinator`
db/view/view_building_worker: update built views locally
db/view: introduce `view_building_worker`
db/view: extract common view building functionalities
db/view: prepare to create abstract `view_consumer`
message/messaging_service: add `work_on_view_building_tasks` RPC
service/topology_coordinator: make `term_changed_error` public
db/schema_tables: create/cleanup tasks when an index is created/dropped
service/migration_manager: cleanup view building state on drop keyspace
service/migration_manager: cleanup view building state on drop view
service/migration_manager: create view building tasks on create view
test/boost: enable proxy remote in some tests
service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
service/migration_manager: coroutinize `prepare_new_view_announcement()`
service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
service: reload `view_building_state_machine` on group0 apply()
service/vb_coordinator: add currently processing base
db/system_keyspace: move `get_scylla_local_mutation()` up
db/system_keyspace: add `view_building_tasks` table
db/view: add view_building_state and views_state
db/system_keyspace: add method to get view build status map
db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
db/system_keyspace: move `internal_system_query_state()` function earlier
db/view: ignore tablet-based views in `view_builder`
gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
The storage submodule contains tests that require mounted volumes
to be executed. The volumes are created automatically with the
`volumes_factory` fixture.
The tests in this suite are executed with the custom launcher
`unshare -mr pytest`
Test scenarios (when one node reaches critical disk utilization):
1. Reject user table writes
2. Disable/Enabled compaction
3. Reject split compactions
4. New split compactions not triggered
5. Abort tablet repair
6. Disable/Enabled incoming tablet migrations
7. Restart a node while a tablet split is triggered
Currently, workdir is set in ScyllaCluster constructor and it does
not take into accout that the value could be overridden via cmdline
arguments. When this happens, then some data (logs, configs) are
stored under one path and other (data) is stored under a different.
The patch allows overriding the value when passed via cmdline arguments
leading to all files being stored under the same path.
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.
The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
This patch adds compatibility testcases for the following cql3 types :
set, list, map, tuple, vector and reversed types.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The `abstract_type::from_string()` method used to parse the input data
doesn't support collections yet. So the collection testdata will be
passed as JSON strings to the testcase. This patch updates the testcase
to adapt to this workaround.
Also, extended the testcase to verify that Scylla's implementation can
successfully decode the byte comparable output encoded by Cassandra.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
A reversed type is first encoded using the underlying type and then all
the bits are flipped to ensure that the lexicographical sort order is
reversed. During decode, the bytes are flipped first and then decoded
using the underlying type.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL vector type encoding is similar to the lists, where each element
is transformed into a byte-comparable format and prefixed with a
component marker. The sequence is terminated with a terminator marker to
indicate the end of the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL tuple and UDT types share the same internal implementation and
therefore use the same byte comparable encoding. The encoding is similar
to lists, where each element is transformed into a byte-comparable
format and prefixed with a component marker. The sequence is terminated
with a terminator marker to indicate the end of the collection.
TODO: Add duplicate test items to maps, lists and sets
For maps, add more entries that share keys
ex map1 : key1 : value1, key2 : value2
map2 : key1 : value4
map3 : key2 : value5 etc
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL map type is encoded as a sequence of key-value pairs. Each key
and each value is individually prefixed with a component marker, and the
sequence is terminated with a terminator marker to indicate the end of
the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The CQL set and list types are encoded as a sequence of elements, where
each element is transformed into a byte-comparable format and prefixed
with a component marker. The sequence is terminated with a terminator
marker to indicate the end of the collection.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
The components of a collection, such as an element from a list, set, or
vector; a key or value from a map; or a field from a tuple, share the
same encode and decode logic. During encode, the component is transformed
into the byte comparable format and is prefixed with the `NEXT_COMPONENT`
marker. During decode, the component is transformed back into its
serialized form and is prefixed with the serialized size.
A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and
during decode, a `-1` is written to the serialized output.
This commit introduces few helper methods that implement the above
mentioned encode and decode logics.
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.
The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.
Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.
Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.
The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.
To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.
Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.
If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.
There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.
The relevant portion of the initialization and deinitialization flow:
(a) Before the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.
(b) After the changes:
1. Initialize `service_level_controller`. Pass a reference to an uninitialized
`auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.
(*):
The reference to `auth::service` in `service_level_controller` is still
necessary. We need to access the service when dropping a distributed
service level.
Although it would be best to cut that link between the service level
controller and `auth::service` too, effectively separating the entities,
it would require more work, so we leave it as-is for now.
It shouldn't prove problematic as far as accessing an uninitialized service
goes. Trying to drop a service level at the point when we're de-initializing
auth should be impossible.
For more context, see the function `drop_distributed_service_level` in
`service_level_controller`.
A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.
Fixesscylladb/scylladb#24792
Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.
Closesscylladb/scylladb#25478
* github.com:scylladb/scylladb:
service/qos: Move effective SL cache to auth_integration
service/qos: Add auth::service to auth_integration
service/qos: Reload effective SL cache conditionally
service/qos: Add gate to auth_integration
service/qos: Introduce auth_integration
Add test cases for create keyspace default replication factor.
It is expected that the default replication factor is equal to the
number of racks containing at least some non-zero-token nodes
in the test suite.
Refs: #16028
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.
One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]
That may have and did cause trouble. Consider this scenario:
1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
table just like any other one (at least as far as the relevant portion of the
logic is concerned). Among other things, it uses
`cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
property is set to some value:
* the default one if the user doesn't specify it in the statement,
* a custom one if they do.
Why is that a problem?
First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).
There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
statement will miss the `tombstone_gc` property as if it couldn't be set for
it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
will include the `tombstone_gc` property. That's even more confusing (why was
it not present the first time, but it is now?).
The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.
The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:
1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.
The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.
Two reproducer tests of scylladb/scylladb#25187 are included in the changes.
Fixesscylladb/scylladb#25187
- Disable tablets in `test_migration_on_existing_raft_topology`.
Because views on tablets are experimental now, we can safely
assume that view building coordinator will start with view build status
on raft.
- Add error injection to pause view building on worker.
Used to pause view building process, there is analogous error injection
in view_builder.
- Do a read barrier in `test_view_in_system_tables`
Increases test stability by making sure that the node sees up-to-date
group0 state and `system.built_views` is synced.
- Wait for view is build in some tests
Increases tests stability by making sure that the view is built.
- Remove xfail marker from `test_tablet_streaming_with_unbuilt_view`
This series fix https://github.com/scylladb/scylladb/issues/21564
and this test should work now.
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.
Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
to `view_update_generator` but create view building tasks for
later
The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
After a few next patches, creating/dropping a view in tablet keyspace
will require a remote proxy to obtain references to system keyspace
and view building state.
Because of this, remote proxy needs to be explicitly enabled in boost
tests which create views.