The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```
This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````
This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.
Fixes: SCYLLADB-635
Closesscylladb/scylladb#28740
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.
Fixes: SCYLLADB-10
Closesscylladb/scylladb#28741
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.
Backport: Implements missing functionality of a tool, no backport.
Closesscylladb/scylladb#28798
* github.com:scylladb/scylladb:
tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
tools/scylla-sstable: generalize dump_if_user_type
tools/scylla-sstable: move dump_if_user_type() definition
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.
the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.
this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.
besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.
Fixes SCYLLADB-842
Closesscylladb/scylladb#28842
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.
Closesscylladb/scylladb#28776
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.
Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.
Continuation of #28600.
Cleaning tests, not backporting
Closesscylladb/scylladb#28608
* github.com:scylladb/scylladb:
test/object_store: Use itertools.product() for deeply nested loops
test/object_store: Replace dataset creation usage with standard methods
test/object_store: Shift indentation right for test_restore_with_streaming_scopes
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.
All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.
Closesscylladb/scylladb#28821
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.
Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.
Cleaning up tests, not backporting.
Closesscylladb/scylladb#28660
* github.com:scylladb/scylladb:
test/backup: Run keyspace flush and snapshot taking API in parallel
test/backup: Re-use take_snapshot() helper in do_abort_restore()
test/backup: Move take_snapshot() helper up
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.
Refs #15422
No backport needed since this removes functionality.
Closesscylladb/scylladb#28514
* https://github.com/scylladb/scylladb:
group0: fix indentation after previous patch
raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
raft_group0: remove unused code from raft_group0
node_ops: remove topology over node ops code
topology: fix indentation after the previous patch
topology: drop topology_change_enabled parameter from raft_group0 code
storage_service: remove unused handle_state_* functions
gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
storage_service: fix indentation after the last patch
storage_service: remove gossiper bootstrapping code
storage_service: drop get_group_server_if_raft_topolgy_enabled
storage_service: drop is_topology_coordinator_enabled and its uses
storage_service: drop run_with_api_lock_in_gossiper_mode_only
topology: remove code that assumes raft_topology_change_enabled() may return false
test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
test: schema_change_test: drop schema tests relevant for no raft mode only
topology: remove upgrade to raft topology code
group0: remove upgrade to group0 code
group0: refuse to boot if a cluster is still is not in a raft topology mode
storage_service: refuse to join a cluster in legacy mode
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem
Closesscylladb/scylladb#28747
* github.com:scylladb/scylladb:
test: add test_uninitialized_conns_semaphore
generic_server: fix waiters count in shed log
generic_server: scale connection concurrency semaphore by listener count
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.
The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.
In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
Or after the table already exists with:
```cql
ALTER TABLE tab TTL expiration
```
Expiration can also be disabled, with:
```cql
ALTER TABLE tab TTL NULL
```
The new per-row TTL feature has two features that users have been asking for:
1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.
To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.
The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.
The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`
The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.
This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.
Fixes#13000Closesscylladb/scylladb#28320
* github.com:scylladb/scylladb:
test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
docs/cql: document the new CQL per-row TTL feature
test/cluster: tests for the new CQL per-row TTL feature
test/cqlpy: tests for the new CQL per-row TTL feature
test: set low alternator_ttl_period_in_seconds in CQL tests
cql ttl: fix ALTER TABLE to disable TTL if column is dropped
cql ttl: add setting/unsetting of TTL column to ALTER TABLE
cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
ttl: add CQL support to Alternator's TTL expiration service
alternator ttl: move TTL_TAG_KEY to a header file
alternator ttl: remove unnecessary check of feature flag
cql: add "cql_row_ttl" cluster feature
alternator: fix error message if UpdateTimeToLive is not supported
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.
The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.
Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.
Add tests to verify the type boundaries.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Closesscylladb/scylladb#28762
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.
Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.
Fixes SCYLLADB-769
Closesscylladb/scylladb#28817
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that use the same wait_for dtest function.
`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.
Refs SCYLLADB-573
This is a performance improvement of testing framework. No need to backport.
Closesscylladb/scylladb#28590
* github.com:scylladb/scylladb:
dtest: shorten default sleep step in wait_for
dtest: wait_for speedup
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".
So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:
1. Test that the TTL expiration work - both the scanning threads and
the actual deletion work on all nodes - happens on the "streaming"
scheduling group.
2. Test that even if one of the cluster's nodes is down, still all
the items get expired - another node "takes over" the dead node's
work.
3. Test that rolling upgrade works as designed for the CQL per-row TTL
feature: Before every single node in the cluster is upgraded to
support this feature, a TTL column cannot be enabled on a table.
And as soon as the last node of the cluster is upgraded, the TTL
feature begins to work completely (you don't need to reboot all
the nodes again).
4. Test that expiration works correctly on a multi-DC setup. The test
doesn't check the efficiency of this process - i.e., that today each
DC scans part of the data, reading with LOCAL_QUORUM, and writing
the deletions across the entire cluster. Rather, the test only
verifies the correctness - that expired rows do get deleted -
for the usual case the data across the DCs is consistent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.
These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.
These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.
The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.
All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.
Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.
Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This series closes a gap in how CQL request and response sizes are reported.
Previously, request_size and response_size were tracked as simple counters,
providing only cumulative totals per shard. This made it difficult to understand
the distribution of message sizes and identify potential issues with very large
or very small requests.
After this series, the CQL transport reports detailed histogram metrics showing
the distribution of request and response sizes. These histograms are tracked
per-instance, per-type (per ops), and per-scheduling-group, providing
much better visibility into CQL traffic patterns.
The histograms are collected for QUERY, EXECUTE, and BATCH operations, which are
the primary data path operations where message size distribution is most relevant.
This data can help identify:
- Clients sending unexpectedly large requests
- Operations with oversized result sets
- Scheduling group differences in traffic patterns
To support this, the series extends the approx_exponential_histogram template to
handle accurate sum, adds a bytes_histogram type alias optimized for byte-range measurements (1KB to 1GB).
The existing per-shard counter metrics are maintained for backward compatibility.
Metrics example:
```
scylla_transport_cql_request_bytes{kind="BATCH",scheduling_group_name="sl:default",shard="0"} 129808
scylla_transport_cql_request_bytes{kind="EXECUTE",scheduling_group_name="sl:default",shard="0"} 227409
scylla_transport_cql_request_bytes{kind="PREPARE",scheduling_group_name="sl:default",shard="0"} 631
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:default",shard="0"} 2809
scylla_transport_cql_request_bytes{kind="QUERY",scheduling_group_name="sl:driver",shard="0"} 4079
scylla_transport_cql_request_bytes{kind="REGISTER",scheduling_group_name="sl:default",shard="0"} 98
scylla_transport_cql_request_bytes{kind="STARTUP",scheduling_group_name="sl:driver",shard="0"} 432
scylla_transport_cql_request_histogram_bytes_sum{kind="QUERY",scheduling_group_name="sl:driver"} 4079
scylla_transport_cql_request_histogram_bytes_count{kind="QUERY",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1024.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2048.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4096.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8192.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16384.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="32768.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="65536.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="131072.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="262144.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="524288.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1048576.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="2097152.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="4194304.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="8388608.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="16777216.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="33554432.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="67108864.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="134217728.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="268435456.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="536870912.000000",scheduling_group_name="sl:driver"} 57
scylla_transport_cql_request_histogram_bytes_bucket{kind="QUERY",le="1073741824.000000",scheduling_group_name="sl:driver"} 57
```
**The field sees it as an important issue**
Fixes#14850Closesscylladb/scylladb#28419
* github.com:scylladb/scylladb:
test/boost/estimated_histogram_test.cc: Switch to real Sum
transport/server: to bytes_histogram
approx_exponential_histogram: Add sum() method for accurate value tracking
utils/estimated_histogram.hh: Add bytes_histogram
When a permit is preemptively aborted, store the corresponding
exception in permit's member: `reader_permit::impl::_ex`.
This makes preemptively-aborted permits consistently report aborted()
and prevents them from being treated as eligible for inactive
registration in `register_inactive_read()`, avoiding assertion
failures on unexpected permit state.
Closesscylladb/scylladb#28591
Refs: SCYLLADB-193
Adds a "snapshot_table" topology operation and associated data structure/table columns to support dispatching a snapshot operation as a topo coordinator op.
Logic is similar, and thus broken out and semi-shared with, truncation.
Also adds optional tablet metadata to manifest, listing all tablets present in a given snapshot, as well as
tablet sstable ownership, repair status, and token ranges.
As per description in SCYLLADB-193, the alternative snapshot mechanism is in
a separate namespace under 'tablets', which while dubious is the desired destination.
The API is accessed via `nodetool cluster snapshot`, which more or less mirrors `nodetool snapshot`, but using topo op.
TTL is added to message propagation as a separate patch here, since it is not (yet) used from API (or nodetool).
Requires a syntax for both API and command line.
Closesscylladb/scylladb#28525
* github.com:scylladb/scylladb:
topology::snapshot: Add expiry (ttl) to RPC/topo op
test_snapshot_with_tablets: Extend test to check manifest content
table::manifest: Add tablet info to manifest.json
test::test_snapshot_with_tablets: Add small test for topo coordinated snapshot
scylla-nodetool: Add "cluster snapshot" command
api::storage_service: Add tablets/snapshots command for cluster level snapshot
db::snapshot-ctl: Add method to do snapshot using topo coordinator
storage_proxy: Add snapshot_keyspace method
topology_coordinator: Add handler for snapshot_tables
storage_proxy: Add handler for SNAPSHOT_WITH_TABLETS
messaging_service: Add SNAPSHOT_WITH_TABLETS verb
feature_service: Add SNAPSHOT_AS_TOPOLOGY_OPERATION feature
topology_mutation: Add setter for snapshot part of row
system_keyspace::topology_requests_entry: Add snapshot info to table
topology_state_machine: Add snapshot_tables operation
topology_coordinator: Break out logic from handle_truncate_table
storage_proxy: Break out logic from request_truncate_with_tablets
test/object_store: Remove create_ks_and_cf() helper
test/object_store: Replace create_ks_and_cf() usage with standard methods
test/object_store: Shift indentation right for test cases
Migrate cluster tests directory to be handled by pytest. This is the next step in process of unification of the tests and migration to the pytest.
With this PR cluster test will be executed with the full path to the file instead of `suite/test` paradigm.
Backport is not needed because it framework enhancement.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-46Closesscylladb/scylladb#27618
* github.com:scylladb/scylladb:
test.py: remove setsid from the framework
test.py: rename suite.yaml to test_config.yaml
test.py: add cluster tests to be executed by pytest
test.py: add random seed for topology tests reproducibility
test.py: add explicit default values to pytest options
test.py: replace SCYLLA env var with build_mode fixture
Recently we suffered a regression on how Alternator TTL behaves when a node goes down when tablets are used.
Usually, expiration of data in a particular tablet are handled by this tablet's "primary replica". However, if that node is down, we want another node to perform these expiration until the primary replica goes back online. We created a function `tablet_map::get_secondary_replica()` to select that "other node". We don't care too much what the "secondary replica" means, but we do care that it's different from the primary replica - if it's the same the expiration of that tablet will never be done.
It turns out that recently, in commits 817fdad and d88036d, the implementation of get_primary_replica() changed without a corresponding change to get_secondary_replica(). After those changes, the two functions are mismatched, and sometimes return the same node for both primary and secondary replica.
Unfortunately, although we had a dtest for the handling of a dead node in Alternator TTL, it failed to reproduce this bug, so this regression was missed - nothing else besides Alternator TTL ever used the get_secondary_replica() function.
So this series, in addition to fixing the bug, we add two tests that reproduce this bug (fail before the fix, pass with the fix):
1. A unit test that checks that get_secondary_replica() always returns a different node from get_primary_replica()
2. A cluster test based on the original dtest, which does reproduce this bug in Alternator TTL where some of the data was never expired (but only failed in release build, for an unknown reason).
Fixes SCYLLADB-777.
Closesscylladb/scylladb#28771
* github.com:scylladb/scylladb:
test: add unit test for tablet_map::get_secondary_replica()
test, alternator: add test for TTL expiration with a node down
locator: fix get_secondary_replica() to match get_primary_replica()
The path removes the code protected by !raft_topology_change_enabled()
since it is no longer reachable. Drop test_lwt_for_tablets_is_not_supported_without_raft
since not raft mode is no longer supported.
It is not enough to go over all column types and register the UDTs. UDTs
might be nested in other types, like collections. One has to do a
traversal of the type tree and register every UDT on the way. That is
what this patch does.
This function is used by the query and write operations, which should
now both work with nested UDTs.
Add a test which fails before and passes after this patch.
Audit tests have been slow. They rely on wait_for function.
This function first sleeps for the duration of the time step
specified, and then calls the given function. The audit tests
need 0.02-0.03 seconds for the given function, but the operation
lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes
the given function, and afterwards calls time.sleep(step). This
reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine
the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that
use the same wait_for dtest function.
Refs SCYLLADB-573
The current way of checking the boost's stdout can have a race
condition when pytest will try to read the file before it was really
flushed. So this PR should eliminate this possibility.
Closesscylladb/scylladb#28783
This pr adds await for each one of the tasks to wait for the MV schema to be added successfully
and then to start the server shutdown
With this change we dont need will not get the shutdown races.
Closesscylladb/scylladb#28774
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.
`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.
Fixes: scylladb/scylladb#28204Closesscylladb/scylladb#28746
For a while, we have seen coroutine related tests (those that use the
coroutine_task fixture) fail occasionally, because no coroutine frame is
found. Multiple attempts were made to make this problem self-diagnosing
and dump enough information to be able to debug this post-mortem. To no
avail so far. A lot of time was invested into this this benign issue:
See the long discussion at https://github.com/scylladb/scylladb/issues/22501.
It is not known if the bug is in gdb, or the gdb script trying to find
the coroutine frame. In any case, both are only used for debugging, so
we can tolerate occasional failures -- we are forced to do so when
working with gdb anyway.
Instead of piling on more effor there, just skip these tests when the
problem occurs. This solves the CI flakyness.
Fixes: #22501Closesscylladb/scylladb#28745
Add --continue-after-error true to perf-cql-raw and perf-alternator
tests, and --stop-on-error false to perf-simple-query test, so that
tests don't abort on the first error.
Reason for this is that tests are flaky with example failure:
Perf test failed: std::runtime_error (server returned ERROR to EXECUTE)
When CPU is starved on CI we can return timeouts and/or other errors.
The change should make tests more robust on the expense of smaller test
scope. But those tests were written mostly to test startup sequence
as it differs from Scylla's starup.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-759Closesscylladb/scylladb#28767
With previous architecture, scylla servers were handled by test.py and
if pytest fails, test.py was responsible for stopping scylla processes.
Now with only pytest handling, there is no such mechanism, that's why
I'm removing the setsid, so when the parent pytest process closes it
will automatically close all child including any started process during
testing. This will allow to not leave any scylla process in case pytest
was killed.
Set TOPOLOGY_RANDOM_FAILURES_TEST_SHUFFLE_SEED environment variable
during pytest configuration to enable to ensure that all xdist workers will
discover the same scope of the tests. This is a known limitation of the
xdist plugi where the discovered tests should be consistenta across
master and workers.
Add explicit default values to pytest command line options to prevent
issues when running tests with pytest's parallel execution where
options are not present on upper conftest, so they're just not set at all.
Replace direct usage of SCYLLA environment variable with the build_mode
pytest fixture and path_to helper function. This makes tests more
flexible and consistent with the test framework. Also this allows to use
tests with xdist, where environment variable can be left in the master
process and will not be set in the workers
Add using the fixture to get the scylla binary from the suite, this will
align with getting relocatable Scylla exe.