Commit Graph

9458 Commits

Author SHA1 Message Date
Dawid Mędrek
d2c5268196 cql3: Produce CREATE MATERIALIZED VIEW statement when describing MV of index
Before this change, executing `DESCRIBE MATERIALIZED VIEW` on the underlying
materialized view of a secondary index would produce a `CREATE INDEX` statement.
It was not only confusing, but it also prevented from learning about
the definition of the view. The only way to do so was to query system tables.

We change that behavior and produce a `CREATE MATERIALIZED VIEW` statement
instead. The statement is printed as a comment to implicitly convey that
the user should not attempt to execute it to restore the view. A short comment
is provided to make it clearer.

Before this commit:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;

CREATE INDEX i ON ks.t(v);
```

After this commit:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE MATERIALIZED VIEW ks.i;

/* Do NOT execute this statement! It's only for informational purposes.
   This materialized view is the underlying materialized view of a secondary
   index. It can be restored via restoring the index.

CREATE MATERIALIZED VIEW ks.i_index [...];

*/
```

Note that describing the base table has not been affected and still works
as follows:

```
cqlsh> CREATE TABLE ks.t(p int PRIMARY KEY, v int);
cqlsh> CREATE INDEX i ON ks.t(v);
cqlsh> DESCRIBE TABLE ks.t;

CREATE TABLE ks.t (
    p int,
    v int,
    PRIMARY KEY (p)
) WITH bloom_filter_fp_chance = 0.01
    AND caching = {'keys': 'ALL', 'rows_per_partition': 'ALL'}
    AND comment = ''
    AND compaction = {'class': 'IncrementalCompactionStrategy'}
    AND compression = {'sstable_compression': 'org.apache.cassandra.io.compress.LZ4Compressor'}
    AND crc_check_chance = 1
    AND default_time_to_live = 0
    AND gc_grace_seconds = 864000
    AND max_index_interval = 2048
    AND memtable_flush_period_in_ms = 0
    AND min_index_interval = 128
    AND speculative_retry = '99.0PERCENTILE'
    AND tombstone_gc = {'mode': 'timeout', 'propagation_delay_in_seconds': '3600'};

CREATE INDEX i ON ks.t(v);
```

We also provide two reproducers of scylladb/scylladb#24610.

Fixes scylladb/scylladb#24610

Closes scylladb/scylladb#25697
2025-09-03 15:21:37 +02:00
Botond Dénes
6116f9e11b Merge 'Compaction tasks progress' from Aleksandra Martyniuk
Determine the progress of compaction tasks that have
children.

The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)

If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.

All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.

Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.

Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.

New feature, no backport needed

Closes scylladb/scylladb#15158

* github.com:scylladb/scylladb:
  test: add compaction task progress test
  compaction: set progress unit for compaction tasks
  compaction: find expected workload for reshard tasks
  compaction: find expected workload for global cleanup compaction tasks
  compaction: find expected workload for global major compaction tasks
  compaction: find expected workload for keyspace compaction tasks
  compaction: find expected workload for shard compaction tasks
  compaction: find expected workload for table compaction tasks
  compaction: return empty progress when compaction_size isn't set
  compaction: update compaction_data::compaction_size at once
  tasks: do not check expected workload for done task
2025-09-03 13:23:42 +03:00
Pavel Emelyanov
b0aa2d61d9 Merge 'cql3: add default replication factor to create_keyspace_statement' from Dario Mirovic
When creating a new keyspace, replication factor must be stated.
For example:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy', 'replication_factor': 3 };`

This patch changes it in the following way - if there is no
replication factor specified, use default replication factor.
Default replication factor is equal to the number of racks that
are not arbiter-only, i.e. racks that have at least one non-arbiter node.
The following syntax is now valid:
`CREATE KEYSPACE ks WITH REPLICATION { 'class': 'NetworkTopologyStrategy' };`
`CREATE KEYSPACE ks WITH REPLICATION { };`

Fixes #16028

Backport is not needed. This is an enhancement for future releases.

Closes scylladb/scylladb#25570

* github.com:scylladb/scylladb:
  docs/cql: update documentation for default replication factor
  test/cqlpy: add keyspace creation default replication factor tests
  cql3: add default replication factor to `create_keyspace_statement`
2025-09-03 12:31:53 +03:00
Radosław Cybulski
7b3d42f83e Remove unused boost macro definitions
Closes scylladb/scylladb#25742
2025-09-03 10:06:33 +03:00
Calle Wilund
bc20861afb system_keyspace: Prune dropped tables from truncation on start/drop
Fixes #25683

Once a table drop is complete, there should be no reason to retain
truncation records for it, as any replay should skip mutations
anyway (no CF), and iff we somehow resurrect a dropped table,
this replay-resurrected data is the least problem anyway.

Adds a prune phase to the startup drop_truncation_rp_records run,
which ignores updating, and instead deletes records for non-existant
tables (which should patch any existing servers with lingering data
as well).

Also does an explicit delete of records on actual table DROP, to
ensure we don't grow this table more than needed even in long
uptime nodes.

Small unit test included.

Closes scylladb/scylladb#25699
2025-09-03 07:25:34 +03:00
Sergey Zolotukhin
13392a40d4 gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change adds a check after locking the map entry: if a gossip ACK update
does not contain a host ID, we verify that an entry with that host ID
still exists in the gossiper’s _endpoint_state_map.

Fixes scylladb/scylladb#25702
Fixes scylladb/scylladb#25621
Ref scylladb/scylla-enterprise#5613

Closes scylladb/scylladb#25727
2025-09-02 20:44:21 +02:00
Avi Kivity
7ed261fc52 Merge 'Inital GCP object storage support' from Calle Wilund
Adds infrastructure and client for interaction with GCP object storage services.

Note: this is just a client object usable for creating, listing, deleting and up/downloading of objects to/from said storage service. It makes no attempt at actually inserting it into the sstable storage flow. That can come later.

This PR breaks out GCP auth and some general REST call functionality into shared routines. Not all code is 100% reused, but at least some.

Test is added, though could be more comprehensive (feel free to suggest a test vector).
Test can run in either local mock server mode (default), or against actual GCP.
See `test/boost/gcp_object_storage_test.cc` for explanation on the config environment vars.
Default is to run the test against a temporary docker deamon.

Closes scylladb/scylladb#24629

* github.com:scylladb/scylladb:
  test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
  proc-utils: Re-export waiting types from seastar
  proc-utils: Inherit environment from current process
  utils::gcp::object_storage: Add client for GCP object storage
  utils::http: Add optional external credentials to dns_connection_factory init
  utils::rest: Break out request wrapper and send logic
  encryption::gcp_host: Use shared gcp credentials + REST helpers
  utils::gcp: Move/add gcp credentials management to shared file
  utils::rest::client: Add formatter for seastar::http::reply
  utils::rest::client: Add helper routines for simple REST calls
  utils::http: Make shared system trust certificates public
2025-09-02 14:38:09 +03:00
Piotr Dulikowski
762d9ef68f Merge 'cdc: Set tombstone_gc when creating log table' from Dawid Mędrek
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.

One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]

That may have and did cause trouble. Consider this scenario:

1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
   table just like any other one (at least as far as the relevant portion of the
   logic is concerned). Among other things, it uses
   `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
   property is set to some value:
   * the default one if the user doesn't specify it in the statement,
   * a custom one if they do.

Why is that a problem?

First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).

There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
   statement will miss the `tombstone_gc` property as if it couldn't be set for
   it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
   will include the `tombstone_gc` property. That's even more confusing (why was
   it not present the first time, but it is now?).

The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.

The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:

1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.

The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.

Two reproducer tests of scylladb/scylladb#25187 are included in the changes.

Backport: The problem is not critical, so it may not be necessary to backport the changes.
That's to be discussed.

Closes scylladb/scylladb#25521

* github.com:scylladb/scylladb:
  cdc: Set tombstone_gc when creating log table
  tombstone_gc: Add overload of get_default_tombstone_gc_mode
  tombstone_gc: Rename get_default_tombstonesonte_gc_mode
2025-09-02 10:20:11 +02:00
Tomasz Grabiec
a7f10b585e Merge 'drop table: fix crash on drop table with concurrent cleanup' from Ferenc Szili
Consider the following scenario:

- A tablet is migrated away from a shard
- The tablet cleanup stage closes the storage group's async_gate
- A drop table runs truncate which attempts to disable compaction on the tablet with its gate closed. This fails, because table::parallel_foreach_compaction_group() ultimately calls storage_group_manager::parallel_foreach_storage_group() which will not disable compaction if it can't hold the storage group's gate
- Truncate calls table::discard_sstables() which checks if the compaction has been disabled, and because it hasn't, it then runs on_internal_error() with "compaction not disabled on table ks.cf during TRUNCATE" which causes a crash

Fixes: #25706

This needs to be backported to all supported versions with tablets

Closes scylladb/scylladb#25708

* github.com:scylladb/scylladb:
  test: reproducer and test for drop with concurrent cleanup
  truncate: check for closed storage group's gate in discard_sstables
2025-09-02 00:02:14 +02:00
Calle Wilund
21adfd8a60 test::boost::gcp_object_storage_test: Initial unit tests for GCP obj storage
Allows testing using either local mock server (installed or using docker),
or real GCP project (not tested as of writing this).

v2: Try podman if docker unavail
v3: Ensure we check log output on fake-gcs, because when using podman, the
    published port will be connectible even though the actual server is not
    up yet.
v4: Use ephermal port forward in docker/podman to allow us running parallel
    instances. Also adjust credentials and port finding in test.
v5: Re-ensure no parallel tests for this: We seem to time out in podman
    trying to fetch image for X parallel tests
v6: Remove the ephermal port stuff. Because of course this does not work
    with our podman-in-podman. Do brute-force port speculation instead.
v7: Up timeout for server start to allow docker pull.
v8: Fix string check error
v9: Add explicit docker image version
2025-09-01 18:14:20 +00:00
Calle Wilund
5ead6ec420 proc-utils: Re-export waiting types from seastar
Just to make directly accessible from wrapper type
2025-09-01 18:03:44 +00:00
Calle Wilund
8169327553 proc-utils: Inherit environment from current process
In most cases, when launching a process from tests, we will want to
inherit our own env. Add option (default true) to do so.
2025-09-01 18:03:44 +00:00
Karol Nowacki
3086d15999 cql3: Fix crash on ANN OF query when TRACING ON is enabled
Executing a vector search (SELECT with ANN OF ordering) query with `TRACING ON` enabled
caused a node to crash due to a null pointer dereference.

This occurred because a vector index does not have an associated view
table, making its `_view_schema` member null. The implementation
attempted to enable tracing on this null view schema, leading to the
crash.

The fix adds a null check for `_view_schema` before attempting to
enable tracing on the view (index) table.

A regression test is included to prevent this from happening again.

Fixes: VECTOR-179

Closes scylladb/scylladb#25500
2025-09-01 17:26:54 +03:00
Artsiom Mishuta
5910ad3c6d test.py: apply the nightly label on test_topology_recovery_basic
This test is for the old gossip-based recovery procedure, which is an almost obsolete feature that won't change anymore.

Closes scylladb/scylladb#25694
2025-09-01 14:16:29 +02:00
Emil Maskovsky
5dac4b38fb test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721

Backport: The test is primarily for an issue found in 2025.1, so it
needs to be backported to all the 2025.x branches.

Closes scylladb/scylladb#25685
2025-09-01 13:59:47 +02:00
Petr Gusev
2e757d6de4 cas: pass timeout_if_partially_accepted := write to accept_proposal()
Write requests cannot be safely retried if some replicas respond with
accepts and others with rejects. In this case, the coordinator is
uncertain about the outcome of the LWT: a subsequent LWT may either
complete the Paxos round (if a quorum observed the accept) or overwrite it
(if a quorum did not). If the original LWT was actually completed by
later rounds and the coordinator retried it, the write could be applied
twice, potentially overwriting effects of other LWTs that slipped in
between. Read requests do not have this problem, so they
can be safely retried.

Before this commit, handler->accept_proposal was called with
timeout_if_partially_accepted := true. This caused both read and write
requests to throw an "uncertainty" timeout to the user in the case
of the contention described above. After this commit, we throw an
"uncertainty" timeout only for write requests, while read requests
are instead retried in the loop in sp::cas.

Closes scylladb/scylladb#25602
2025-09-01 14:31:04 +03:00
Dawid Mędrek
fc50e9d0a4 test/perf: Require smp=1 in perf_cache_eviction
Trying to run the test with more than one shard results in a failure
when generating sharding metadata:

```
ERROR 2025-08-27 16:00:17,551 [shard  0:main] table - Memtable flush failed due to: std::runtime_error (Failed to generate sharding metadata for /tmp/scylla-c9fa42fe/ks/cf-2938a030834e11f0a561ffa33feb022d/me-3gt6_12wh_1gifk2ijgeu1ovc1m5-big-Data.db). Aborting
```

Let's require that the test be run with a single shard.

Closes scylladb/scylladb#25703
2025-09-01 08:59:35 +03:00
Andrei Chekun
e55c8a9936 test.py: modify run to use different junit output filenames
Currently, run will execute twice pytest without modifying the path of the
JUnit XML report. This leads that the second execution of the pytest
will override the report. This PR fixing this issue so both reports will
be stored.

Closes scylladb/scylladb#25726
2025-09-01 08:56:48 +03:00
Avi Kivity
dfc7957a73 Merge 'test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints' from Benny Halevy
Following up on 6129411a5e
improve test_vnode_keyspace_describe_ring be verifying that the
endpoints listed by describe_ring match those returned by the
`natural_endpoints` api (for random tokens).
The latter are calculated using an independent code path
directly from the effective_replication_map.

* test exists currently only on master, no backport required

Closes scylladb/scylladb#25610

* github.com:scylladb/scylladb:
  test/cluster/test_repair: test_vnode_keyspace_describe_ring: verify that describe_ring results agree with natural_endpoints
  test/pylib/rest_client: add natural_endpoints function
2025-08-31 20:36:15 +03:00
Avi Kivity
bae66cc0d8 Merge 'types: add byte-comparable format support for collections' from Lakshmi Narayanan Sreethar
This PR builds on the byte comparable support introduced in #23541 to add byte comparable support for all the collection types.

This implementation adheres to the byte-comparable format specification in https://github.com/apache/cassandra/blob/trunk/src/java/org/apache/cassandra/utils/bytecomparable/ByteComparable.md

Refs https://github.com/scylladb/scylladb/issues/19407

New feature - backport not required.

Closes scylladb/scylladb#25603

* github.com:scylladb/scylladb:
  types/comparable_bytes: add compatibility testcases for collection types
  types/comparable_bytes: update compatibility testcase to support collection types
  types/comparable_bytes: support empty type
  types/comparable_bytes: support reversed types
  types/comparable_bytes: support vector cql3 type
  types/comparable_bytes: support tuple and UDT cql3 type
  types/comparable_bytes: support map cql3 type
  types/comparable_bytes: support set and list cql3 types
  types/comparable_bytes: introduce encode/decode_component
  types/comparable_bytes: introduce to_comparable_bytes/from_comparable_bytes
2025-08-31 15:53:27 +03:00
Nadav Har'El
ff91027eac utils, alternator: fix detection of invalid base-64
This patch fixes an error-path bug in the base-64 decoding code in
utils/base64.cc, which among other things is used in Alternator to decode
blobs in JSON requests.

The base-64 decoding code has a lookup table, which was wrongly sized 255
bytes, but needed to be 256 bytes. This meant that if the byte 255 (0xFF)
was included in an invalid base-64 string, instead of detecting that this
is an invalid byte (since the only valid bytes in a base-64 string are
A-Z,a-z,0-9,+,/ and =), the code would either think it's valid with a
nonsense 6-bit part, or even crash on an out-of-bounds read.

Besides the trivial fix, this patch also includes a reproducing test,
which tries to write a blob as a supposedly base-64 encoded string with
a 0xFF byte in it. The test fails before this patch (the write succeeds,
unexpectedly), and passes after this patch (the write fails as
expected). The test also passes on DynamoDB.

Fixes #25701

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25705
2025-08-31 15:38:01 +03:00
Piotr Wieczorek
5add43e15c alternator: streams: Address minor incompatibilities with DynamoDB in GetRecords response.
This commit adds missing fields to GetRecords responses: `awsRegion` and
`eventVersion`. We also considered changing `eventSource` from
`scylladb:alternator` to `aws:dynamodb` and setting `SizeBytes` subfield
inside the `dynamodb` field.

We set `awsRegion` to the datacenter's name of the node that received
the request. This is in line with the AWS documentation, except that
Scylla has no direct equivalent of a region, so we use the datacenter's
name, which is analogous to DynamoDB's concept of region.

The field `eventVersion` determines the structure of a Record. It is
updated whenever the structure changes. We think that adding a field
`userIdentity` bumped the version from `1.0` to `1.1`. Currently, Scylla
doesn't support this field (#11523), hence we use the older 1.0 version.

We have decided to leave `eventSource` as is, since it's easy to modify
it in case of problems to `aws:dynamodb` used by DynamoDB.

Not setting `SizeBytes` subfield inside the `dynamodb` field was
dictated by the lack of apparent use cases. The documentation is unclear
about how `SizeBytes` is calculated and after experimenting a little
bit, I haven't found an obvious pattern.

Fixes: #6931

Closes scylladb/scylladb#24903
2025-08-31 14:55:47 +03:00
Avi Kivity
bc5773f777 Merge 'Add out of space prevention mechanisms' from Łukasz Paszkowski
When a scaling out is delayed or fails, it is crucial to ensure that clusters remain operational
and recoverable even under extreme conditions. To achieve this, the following proactive measures
are implemented:
- reject writes
      - includes: inserts, updates, deletes, counter updates, hints, read+repair and lwt writes
      - applicable to: user tables, views, CDC log, audit, cql tracing
- stop running compactions/repairs and prevent from starting new ones
- reject incoming tablet migrations

The aforementioned mechanisms are automatically enabled when node's disk utilization reaches
the critical level (default: 98%) and disabled when the utilization drop below the threshold.

Apart from that, the series add tests that require mounted volumes to simulate out of space.
The paths to the volumes can be provided using the a pytest argument, i.e.  `--space-limited-dirs`.
When not provided, tests are skipped.

Test scenarios:

1. Start a cluster and write data until one of the nodes reaches 90% of the disk utilization
2. Perform an **operation** that would take the nodes over 100%
3. The nodes should not exceed the critical disk utilization (98% by default)
4. Scale out the cluster by adding one node per rack
5. Retry or wait for the **operation** from step 2

The **operation** is: writing data, running compactions, building materialized views, running repair,
migrating tablets (caused by RF change, decommission).

The test is successful, if no nodes run out of space, the **operation** from step 2 is
aborted/paused/timed out and the **operation** from step 5 is successful.

`perf-simple-query --smp 1 -m 1G` results obtained for fixed 400MHz frequency:

Read path (before)

```
instructions_per_op:
	mean=   39661.51 standard-deviation=34.53
	median= 39655.39 median-absolute-deviation=23.33
	maximum=39708.71 minimum=39622.61
```

Read path (after)

```
instructions_per_op:
	mean=   39691.68 standard-deviation=34.54
	median= 39683.14 median-absolute-deviation=11.94
	maximum=39749.32 minimum=39656.63
```

Write path (before):

```
instructions_per_op:
	mean=   50942.86 standard-deviation=97.69
	median= 50974.11 median-absolute-deviation=34.25
	maximum=51019.23 minimum=50771.60
```

Write path (after):

```
instructions_per_op:
	mean=   51000.15 standard-deviation=115.04
	median= 51043.93 median-absolute-deviation=52.19
	maximum=51065.81 minimum=50795.00
```

Fixes: https://github.com/scylladb/scylladb/issues/14067
Refs: https://github.com/scylladb/scylladb/issues/2871

No backport, as it is a new feature.

Closes scylladb/scylladb#23917

* github.com:scylladb/scylladb:
  tests/cluster: Add new storage tests
  test/scylla_cluster: Override workdir when passed via cmdline
  streaming: Reject incoming migrations
  storage_service: extend locator::load_stats to collect per-node critical disk utilization flag
  repair_service: Add a facility to disable the service
  compaction_manager: Subscribe to out of space controller
  compaction_manager: Replace enabled/disabled states with running state
  database: Add critical_disk_utilization mode database can be moved to
  disk_space_monitor: add subscription API for threshold-based disk space monitoring
  docs: Add feature documentation
  config: Add critical_disk_utilization_level option
  replica/exceptions: Add a new custom replica exception
2025-08-30 18:47:57 +03:00
Calle Wilund
cc9eb321a1 commitlog: Ensure segment deletion is re-entrant
Fixes #25709

If we have large allocations, spanning more than one segment, and
the internal segment references from lead to secondary are the
only thing keeping a segment alive, the implicit drop in
discard_unused_segments and orphan_all can cause a recursive call
to discard_unused_segments, which in turn can lead to vector
corruption/crash, or even double free of segment (iterator confusion).

Need to separate the modification of the vector (_segments) from
actual releasing of objects. Using temporaries is the easiest
solution.

To further reduce recursion, we can also do an early clear of
segment dependencies in callbacks from segment release (cf release).

Closes scylladb/scylladb#25719
2025-08-30 08:24:57 +02:00
Piotr Dulikowski
7ccb50514d Merge 'Introduce view building coordinator' from Michał Jadwiszczak
This patch introduces `view_building_coordinator`, a single entity within whole cluster responsible for building tablet-based views.

The view building coordinator takes slightly different approach than the existing node-local view builder. The whole process is split into smaller view building tasks, one per each tablet replica of the base table.
The coordinator builds one base table at a time and it can choose another when all views of currently processing base table are built.
The tasks are started by setting `STARTED` state and they are executed by node-local view building worker. The tasks are scheduled in a way, that each shard processes only one tablet at a time (multiple tasks can be started for a shard on a node because a table can have multiple views but then all tasks have the same base table and tablet (last_token)). Once the coordinator starts the tasks, it sends `work_on_view_building_tasks` RPC to start the tasks and receive their results.
This RPC is resilient to RPC failure or raft leader change, meaning if one RPC call started a batch of tasks but then failed (for instance the raft leader was changed and caller aborted waiting for the response), next RPC call will attach itself to the already started batch.

The coordinator plugs into handling tablet operations (migration/resize/RF change) and adjusts its tasks accordingly. At the start of each tablet operation, the coordinator aborts necessary view building tasks to prevent https://github.com/scylladb/scylladb/issues/21564. Then, new adjusted tasks are created at the end of the operation.
If the operation fails at any moment, aborted tasks are rollback.

The view building coordinator can also handle staging sstables using process_staging view building tasks. We do this because we don't want to start generating view updates from a staging sstable prematurely, before the writes are directed to the new replica (https://github.com/scylladb/scylladb/issues/19149).

For detailed description check: `docs/dev/view-building-coordinator.md`

Fixes https://github.com/scylladb/scylladb/issues/22288
Fixes https://github.com/scylladb/scylladb/issues/19149
Fixes https://github.com/scylladb/scylladb/issues/21564
Fixes https://github.com/scylladb/scylladb/issues/17603
Fixes https://github.com/scylladb/scylladb/issues/22586
Fixes https://github.com/scylladb/scylladb/issues/18826
Fixes https://github.com/scylladb/scylladb/issues/23930

---

This PR is reimplementation of https://github.com/scylladb/scylladb/pull/21942

Closes scylladb/scylladb#23760

* github.com:scylladb/scylladb:
  test/cluster: add view build status tests
  test/cluster: add view building coordinator tests
  utils/error_injection: allow to abort `injection_handler::wait_for_message()`
  test: adjust existing tests
  utils/error_injection: add injection with `sleep_abortable()`
  db/view/view_builder: ignore `no_such_keyspace` exception
  docs/dev: add view building coordinator documentation
  db/view/view_building_worker: work on `process_staging` tasks
  db/view/view_building_worker: register staging sstable to view building coordinator when needed
  db/view/view_building_worker: discover staging sstables
  db/view/view_building_worker: add method to register staging sstable
  db/view/view_update_generator: add method to process staging sstables instantly
  db/view/view_update_generator: extract generating updates from staging sstables to a method
  db/view/view_update_generator: ignore tablet-based sstables
  db/view/view_building_coordinator: update view build status on node join/left
  db/view/view_building_coordinator: handle tablet operations
  db/view: add view building task mutation builder
  service/topology_coordinator: run view building coordinator
  db/view: introduce `view_building_coordinator`
  db/view/view_building_worker: update built views locally
  db/view: introduce `view_building_worker`
  db/view: extract common view building functionalities
  db/view: prepare to create abstract `view_consumer`
  message/messaging_service: add `work_on_view_building_tasks` RPC
  service/topology_coordinator: make `term_changed_error` public
  db/schema_tables: create/cleanup tasks when an index is created/dropped
  service/migration_manager: cleanup view building state on drop keyspace
  service/migration_manager: cleanup view building state on drop view
  service/migration_manager: create view building tasks on create view
  test/boost: enable proxy remote in some tests
  service/migration_manager: pass `storage_proxy` to `prepare_keyspace_drop_announcement()`
  service/migration_manager: coroutinize `prepare_new_view_announcement()`
  service/storage_proxy: expose references to `system_keyspace` and `view_building_state_machine`
  service: reload `view_building_state_machine` on group0 apply()
  service/vb_coordinator: add currently processing base
  db/system_keyspace: move `get_scylla_local_mutation()` up
  db/system_keyspace: add `view_building_tasks` table
  db/view: add view_building_state and views_state
  db/system_keyspace: add method to get view build status map
  db/view: extract `system.view_build_status_v2` cql statements to system_keyspace
  db/system_keyspace: move `internal_system_query_state()` function earlier
  db/view: ignore tablet-based views in `view_builder`
  gms/feature_service: add VIEW_BUILDING_COORDINATOR feature
2025-08-29 17:28:44 +02:00
Łukasz Paszkowski
e34deea50e tests/cluster: Add new storage tests
The storage submodule contains tests that require mounted volumes
to be executed. The volumes are created automatically with the
`volumes_factory` fixture.

The tests in this suite are executed with the custom launcher
`unshare -mr pytest`

Test scenarios (when one node reaches critical disk utilization):
1. Reject user table writes
2. Disable/Enabled compaction
3. Reject split compactions
4. New split compactions not triggered
5. Abort tablet repair
6. Disable/Enabled incoming tablet migrations
7. Restart a node while a tablet split is triggered
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
4bb5696a5d test/scylla_cluster: Override workdir when passed via cmdline
Currently, workdir is set in ScyllaCluster constructor and it does
not take into accout that the value could be overridden via cmdline
arguments. When this happens, then some data (logs, configs) are
stored under one path and other (data) is stored under a different.

The patch allows overriding the value when passed via cmdline arguments
leading to all files being stored under the same path.
2025-08-29 14:56:13 +02:00
Łukasz Paszkowski
9539e80e54 compaction_manager: Subscribe to out of space controller 2025-08-29 14:56:07 +02:00
Łukasz Paszkowski
3d03b88719 database: Add critical_disk_utilization mode database can be moved to
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.

The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
2025-08-29 13:46:45 +02:00
Lakshmi Narayanan Sreethar
ce0c29e024 types/comparable_bytes: add compatibility testcases for collection types
This patch adds compatibility testcases for the following cql3 types :
set, list, map, tuple, vector and reversed types.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
4547f6f188 types/comparable_bytes: update compatibility testcase to support collection types
The `abstract_type::from_string()` method used to parse the input data
doesn't support collections yet. So the collection testdata will be
passed as JSON strings to the testcase. This patch updates the testcase
to adapt to this workaround.

Also, extended the testcase to verify that Scylla's implementation can
successfully decode the byte comparable output encoded by Cassandra.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
0997b3533c types/comparable_bytes: support empty type
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
b799101a09 types/comparable_bytes: support reversed types
A reversed type is first encoded using the underlying type and then all
the bits are flipped to ensure that the lexicographical sort order is
reversed. During decode, the bytes are flipped first and then decoded
using the underlying type.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
6c2a3e2c51 types/comparable_bytes: support vector cql3 type
The CQL vector type encoding is similar to the lists, where each element
is transformed into a byte-comparable format and prefixed with a
component marker. The sequence is terminated with a terminator marker to
indicate the end of the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
1ccfe522f1 types/comparable_bytes: support tuple and UDT cql3 type
The CQL tuple and UDT types share the same internal implementation and
therefore use the same byte comparable encoding. The encoding is similar
to lists, where each element is transformed into a byte-comparable
format and prefixed with a component marker. The sequence is terminated
with a terminator marker to indicate the end of the collection.

TODO: Add duplicate test items to maps, lists and sets
      For maps, add more entries that share keys
      ex map1 : key1 : value1, key2 : value2
         map2 : key1 : value4
         map3 : key2 : value5 etc

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
ca38c15a97 types/comparable_bytes: support map cql3 type
The CQL map type is encoded as a sequence of key-value pairs. Each key
and each value is individually prefixed with a component marker, and the
sequence is terminated with a terminator marker to indicate the end of
the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
4d5e5f0c84 types/comparable_bytes: support set and list cql3 types
The CQL set and list types are encoded as a sequence of elements, where
each element is transformed into a byte-comparable format and prefixed
with a component marker. The sequence is terminated with a terminator
marker to indicate the end of the collection.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:22 +05:30
Lakshmi Narayanan Sreethar
8e46e8be01 types/comparable_bytes: introduce encode/decode_component
The components of a collection, such as an element from a list, set, or
vector; a key or value from a map; or a field from a tuple, share the
same encode and decode logic. During encode, the component is transformed
into the byte comparable format and is prefixed with the `NEXT_COMPONENT`
marker. During decode, the component is transformed back into its
serialized form and is prefixed with the serialized size.

A null component is encoded as a single `NEXT_COMPONENT_NULL` marker and
during decode, a `-1` is written to the serialized output.

This commit introduces few helper methods that implement the above
mentioned encode and decode logics.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
2025-08-29 12:26:21 +05:30
Łukasz Paszkowski
3e740d25b5 disk_space_monitor: add subscription API for threshold-based disk space monitoring
Introduce the `subscribe` method to disk_space_monitor, allowing clients to
register callbacks triggered when disk utilization crosses a configurable
threshold.

The API supports flexible trigger options, including notifications on threshold
crossing and direction (above/below). This enables more granular and efficient
disk space monitoring for consumers.
2025-08-28 18:06:37 +02:00
Ferenc Szili
1b8a44af75 test: reproducer and test for drop with concurrent cleanup
This change adds a reproducer and test for issue #25706
2025-08-28 16:51:36 +02:00
Avi Kivity
46193f5e79 Merge 'service/qos: Modularize service level controller to avoid invalid access to auth::service' from Dawid Mędrek
Move management over effective service levels from `service_level_controller`
to a new dedicated type -- `auth_integration`.

Before these changes, it was possible for the service level controller to try
to access `auth::service` after it was deinitialized. For instance, it could
happen when reloading the cache. That HAS happened as described in the following
issue: scylladb/scylladb#24792.

Although the problem might have been mitigated or even resolved in
scylladb/scylladb@10214e13bd, it's not clear
how the service will be used in the future. It's best to prevent similar bugs
than trying to fix them later on.

The logic responsible for preventing to access an uninitialized `auth::service`
was also either non-existent, complex, or non-sufficient.

To prevent accessing `auth::service` by the service level controller, we extract
the relevant portion of the code to a separate entity -- `auth_integration`.
It's an internal helper type whose sole purpose is to manage effective service
levels.

Thanks to that, we were able to nest the lifetime of `auth_integration` within
the lifetime of `auth::service`. It's now impossible to attempt to dereference
it while it's uninitialized.

If a bug related to an invalid access is spotted again, though, it might also
be easier to debug it now.

There should be no visible change to the users of the interface of the service
level controller. We strived to make the patch minimal, and the only affected
part of the logic should be related to how `auth::service` is accessed.

The relevant portion of the initialization and deinitialization flow:

(a) Before the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it.
2. Initialize other services.
3. Initialize and start `auth::service`.
4. (work)
5. Stop and deinitialize `auth::service`.
6. Deinitialize other services.
7. Deinitialize `service_level_controller`.

(b) After the changes:

1. Initialize `service_level_controller`. Pass a reference to an uninitialized
   `auth::service` to it. (*)
2. Initialize other services.
3. Initialize and start `auth::service`.
4. Initialize `auth_integration`. Register it in `service_level_controller`.
5. (work)
6. Unregister `auth_integration` in `service_level_controller` and deinitialize
   it.
7. Stop and deinitialize `auth::service`.
8. Deinitialize other services.
9. Deinitialize `service_level_controller`.

(*):
    The reference to `auth::service` in `service_level_controller` is still
    necessary. We need to access the service when dropping a distributed
    service level.

    Although it would be best to cut that link between the service level
    controller and `auth::service` too, effectively separating the entities,
    it would require more work, so we leave it as-is for now.

    It shouldn't prove problematic as far as accessing an uninitialized service
    goes. Trying to drop a service level at the point when we're de-initializing
    auth should be impossible.

    For more context, see the function `drop_distributed_service_level` in
    `service_level_controller`.

A trivial test has been included in the PR. Although its value is questionable
as we only try to reload the service level cache at a specific moment, it's
probably the best we can deliver to provide a reproducer of the issue this patch
is resolving.

Fixes scylladb/scylladb#24792

Backport: The impact of the bug was minimal as it only affected the shutdown.
However, since CI is failing because of it, let's backport the change to all
supported versions.

Closes scylladb/scylladb#25478

* github.com:scylladb/scylladb:
  service/qos: Move effective SL cache to auth_integration
  service/qos: Add auth::service to auth_integration
  service/qos: Reload effective SL cache conditionally
  service/qos: Add gate to auth_integration
  service/qos: Introduce auth_integration
2025-08-28 13:42:55 +03:00
Aleksandra Martyniuk
773ae73704 test: add compaction task progress test 2025-08-28 12:10:13 +02:00
Dario Mirovic
fd84da7a50 test/cqlpy: add keyspace creation default replication factor tests
Add test cases for create keyspace default replication factor.
It is expected that the default replication factor is equal to the
number of racks containing at least some non-zero-token nodes
in the test suite.

Refs: #16028
2025-08-28 01:42:34 +02:00
Dawid Mędrek
646f8bc4cd cdc: Set tombstone_gc when creating log table
Normally, when we create a table, MV, etc., we apply `cf_prop_defs` to the
schema builder via the function `cf_prop_defs::apply_to_builder`. Unfortunately,
that didn't happen when creating CDC log tables, and so we might have missed
some of the properties that would normally be set to some value, even if the
default one.

One particular example of that phenomenon was `tombstone_gc`. For better or
worse, it's not a "standalone property" of a table, but rather part of
`extensions`. [Somewhat related issue: scylladb/scylladb#9722]

That may have and did cause trouble. Consider this scenario:

1. A CDC log table is created.
2. The table does NOT have any value of `tombstone_gc` set.
3. The user edits the table via `ALTER TABLE`. That statement treats the log
   table just like any other one (at least as far as the relevant portion of the
   logic is concerned). Among other things, it uses
   `cf_prop_defs::apply_to_builder`, and as a result, the `tombstone_gc`
   property is set to some value:
   * the default one if the user doesn't specify it in the statement,
   * a custom one if they do.

Why is that a problem?

First of all, it's confusing. When we perform a schema backup and a table uses
CDC, we include an ALTER statement for its corresponding CDC log table (for more
context, see issue scylladb/scylladb#18467 or commit
scylladb/scylladb@f12edbdd95).

There are two consequences for the user here:
1. If the log table had NOT been altered ever since it was created, the
   statement will miss the `tombstone_gc` property as if it couldn't be set for
   it at all. That's confusing!
2. If the log table HAD in fact been altered after its creation, the statement
   will include the `tombstone_gc` property. That's even more confusing (why was
   it not present the first time, but it is now?).

The `tombstone_gc` property should always be set to avoid confusion and
problematic edge cases in tests and to simply be consistent with how other
schema entities work.

The solution we employ is that we always set the property to the default
value. That includes the case when we reattach the log table to the base;
consider the following scenario:

1. Create a table with CDC enabled.
2. Detach the log table by performing `ALTER TABLE ... WITH cdc = {'enabled': false}`.
3. Change the `tombstone_gc` property of the log table.
4. Reattach the log table to the base in the same way as in step 2.

The expected result would be that the new value of `tombstone_gc` would be
preserved after reattaching the log table. However, that's not what will
happen. We decide to stay consistent with how other properties of a log
table behave, and we reset them after every reattachment. We might change that
in the future: see issue scylladb/scylladb#25523.

Two reproducer tests of scylladb/scylladb#25187 are included in the changes.

Fixes scylladb/scylladb#25187
2025-08-27 13:18:41 +02:00
Michał Jadwiszczak
39db90a535 test/cluster: add view build status tests 2025-08-27 10:23:04 +02:00
Michał Jadwiszczak
f7ebc7b054 test/cluster: add view building coordinator tests 2025-08-27 10:23:04 +02:00
Michał Jadwiszczak
cf138da853 test: adjust existing tests
- Disable tablets in `test_migration_on_existing_raft_topology`.
Because views on tablets are experimental now, we can safely
assume that view building coordinator will start with view build status
on raft.
- Add error injection to pause view building on worker.
Used to pause view building process, there is analogous error injection
in view_builder.
- Do a read barrier in `test_view_in_system_tables`
Increases test stability by making sure that the node sees up-to-date
group0 state and `system.built_views` is synced.
- Wait for view is build in some tests
Increases tests stability by making sure that the view is built.
- Remove xfail marker from `test_tablet_streaming_with_unbuilt_view`
This series fix https://github.com/scylladb/scylladb/issues/21564
and this test should work now.
2025-08-27 10:23:04 +02:00
Michał Jadwiszczak
233f4dcee3 db/view/view_building_worker: register staging sstable to view building coordinator when needed
Change return type of `check_needs_view_update_path()`. Instead of
retrning bool which tells whether to use staging directory (and register
to `view_update_generator`) or use normal directory.

Now the function returns enum with possible values:
- `normal_directory` - use normal directory for the sstable
- `staging_directly_to_generator` - use staging directory and register
      to `view_update_generator`
- `staging_managed_by_vbc` - use staging directory but don't register it
      to `view_update_generator` but create view building tasks for
      later

The third option is new, it's used when the table has any view which is
in building process currrently. In this case, registering it to `view_update_generator`
prematurely may lead to base-view inconsistency
(for example when a replica is in a pending state).
2025-08-27 10:23:03 +02:00
Michał Jadwiszczak
6e3e287a39 db/schema_tables: create/cleanup tasks when an index is created/dropped
Similarly as in previous commits, create view building tasks when an
index is created and cleanup view building status when it's dropped.
2025-08-27 08:55:47 +02:00
Michał Jadwiszczak
19651b4978 test/boost: enable proxy remote in some tests
After a few next patches, creating/dropping a view in tablet keyspace
will require a remote proxy to obtain references to system keyspace
and view building state.
Because of this, remote proxy needs to be explicitly enabled in boost
tests which create views.
2025-08-27 08:55:47 +02:00