The secondary index mechanism is currently used to determine the target column.
This mechanism works incorrectly for vector indexes with filtering because
it returns the last specified column as the target (vectors) column.
However, the syntax for a vector index requires the first column to be the target:
```
CREATE CUSTOM INDEX ON t(vectors, users) USING 'vector_index';
```
This discrepancy eventually leads to the following exception when performing an
ANN search on a vector index with filtering columns:
````
ANN ordering by vector requires the column to be indexed using 'vector_index'
````
This commit fixes the issue by introducing dedicated logic for vector indexes
to correctly identify the target(vectors) column.
Fixes: SCYLLADB-635
Closesscylladb/scylladb#28740
Split input sstable(s) into multiple output sstables based on the provided
token boundaries. The input sstable(s) are divided according to the specified
split tokens, creating one output sstable per token range.
Fixes: SCYLLADB-10
Closesscylladb/scylladb#28741
The query (and in certain modes the write) operations uses virtual table facility inside `cql_test_env`. The schema of the sstable is created as a table in `cql_test_env`. This involves registering all UDTs with the keyspace, so they are available for lookups.
This was done with a flat loop over all column types, but this is not enough. UDTs might be nested in other types, like collections. One has to do a traversal of the type tree and register every UDT on the way.
This PR changes the flat loop to a recursive traversal of the type tree. The query operation now works with UDTs, no matter how deeply nested they are.
Backport: Implements missing functionality of a tool, no backport.
Closesscylladb/scylladb#28798
* github.com:scylladb/scylladb:
tools/scylla-sstable: create_table_in_cql_env(): register UDTs recursively
tools/scylla-sstable: generalize dump_if_user_type
tools/scylla-sstable: move dump_if_user_type() definition
3f7ee3ce5d introduced system.batchlog_v2, with a schema designed to speed up batchlog replays and make post-replay cleanups much more effective.
It did not introduce a cluster feature for the new table, because it is node local table, so the cluster can switch to the new table gradually, one node at a time.
However, https://github.com/scylladb/scylladb/issues/27886 showed that the switching causes timeouts during upgrades, in mixed clusters. Furthermore, switching to the new table unconditionally on upgrades nodes, means that on rollback, the batches saved into the v2 table are lost.
This PR introduces re-introduces v1 (`system.batchlog`) support and guards the use of the v2 table with a cluster feature, so mixed clusters keep using v1 and thus be rollback-compatible.
The re-introduced v1 support doesn't support post-replay cleanups for simplicity. The cleanup in v1 was never particularly effective anyway and we ended up disabling it for heavy batchlog users, so I don't think the lack of support for cleanup is a problem.
Fixes: https://github.com/scylladb/scylladb/issues/27886
Needs backport to 2026.1, to fix upgrades for clusters using batches
Closesscylladb/scylladb#28736
* github.com:scylladb/scylladb:
test/boost/batchlog_manager_test: add tests for v1 batchlog
test/boost/batchlog_manager_test: make prepare_batches() work with both v1 and v2
test/boost/batchlog_manager_test: fix indentation
test/boost/batchlog_manager_test: extract prepare_batches() method
test/lib/cql_assertions: is_rows(): add dump parameter
tools/scylla-sstable: extract query result printers
tools/scylla-sstable: add std::ostream& arg to query result printers
repair/row_level: repair_flush_hints_batchlog_handler(): add all_replayed to finish log
db/batchlog_manager: re-add v1 support
db/batchlog_manager: return all_replayed from process_batch()
db/batchlog_manager: process_bath() fix indentation
db/batchlog_manager: make batch() a standalone function
db/batchlog_manager: make structs stats public
db/batchlog_manager: allocate limiter on the stack
db/batchlog_manager: add feature_service dependency
gms/feature_service: add batchlog_v2 feature
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```
Fixes SCYLLADB-805
Closesscylladb/scylladb#28829
Following becb48b586 it seems we have a regression with trigger CI logic
The Verify Org Membership step used gh api /orgs/scylladb/members/$AUTHOR
with GITHUB_TOKEN to check if the user is an org member. However,
GITHUB_TOKEN does not have read:org scope, so the API call fails for all
users — even actual scylladb org members — causing CI triggers to be
silently skipped.
Replace the API call with the author_association field from the GitHub
event payload, which is set by GitHub itself and requires no special
token permissions. This allows any scylladb org member (MEMBER or OWNER)
to trigger CI via comment, regardless of whether they authored the PR.
Closesscylladb/scylladb#28837
remove the test since it's not relevant anymore, it's not testing what
it's supposed to test and it's unstable.
the purpose of the test was to reproduce an issue in the legacy view
builder where a view starts to build at token T2 and then all tokens
[T1, end) with T1<T2 migrate to another node while it's still building,
exposing an issue when the view builder wraparounds the token ring.
this is not relevant anymore because now view building with tablets is
done via the view building coordinator for tablets, and all views start
to build from the first token with no wraparound.
besides, the test is unstable due to relying too much on specific
timing, which was useful for investigating and fixing the original issue
but not anymore.
Fixes SCYLLADB-842
Closesscylladb/scylladb#28842
The issue was fixed by commit cc03f5c89d
("cql3: support literals and bind variables in selectors"), so the
xfail marker is no longer needed.
Closesscylladb/scylladb#28776
The test uses create_dataset helper duplicating the existing code that does the same. This PR patches basic tests to use standard facilities.
Also the PR simplifies the 3-level nested loops used to combine several sets of restoration parameters by using itertools.product facility.
Continuation of #28600.
Cleaning tests, not backporting
Closesscylladb/scylladb#28608
* github.com:scylladb/scylladb:
test/object_store: Use itertools.product() for deeply nested loops
test/object_store: Replace dataset creation usage with standard methods
test/object_store: Shift indentation right for test_restore_with_streaming_scopes
The perf-simple-query tests were not restricted on CPU count,
so on a 96-CPU machine, they would run on 96 CPUs, and time out
in debug mode.
All restrict memory usage and add --overprovisioned so that
pinning is disabled. Apply that to all tests.
Closesscylladb/scylladb#28821
The helper is very simple yet generic -- it takes a snapshot of a keyspace on all servers and collects the resulting sstables from workdirs. Re-using it in all test cases saves some lines of code. Also, the method is "sequential", making it "parallel" reduces the waiting time a bit.
Will help generalizing existing backup/restore tests to support clustered snapshot/backup/restore API (see #28525) later.
Cleaning up tests, not backporting.
Closesscylladb/scylladb#28660
* github.com:scylladb/scylladb:
test/backup: Run keyspace flush and snapshot taking API in parallel
test/backup: Re-use take_snapshot() helper in do_abort_restore()
test/backup: Move take_snapshot() helper up
Split the chained inject_parameter().value_or() expression into two
separate named variables for clarity.
Use condition_variable::when() instead of wait(). when() is the
coroutine-native API, designed specifically for co_await contexts —
it avoids the heap allocation of a promise_waiter that wait() uses,
and instead uses a stack-based awaiter.
Closesscylladb/scylladb#28824
The PR removes most of the code that assumes that group0 and raft topology is not enabled. It also makes sure that joining a cluster in no raft mode or upgrading a node in a cluster that not yet uses raft topology to this version will fail.
Refs #15422
No backport needed since this removes functionality.
Closesscylladb/scylladb#28514
* https://github.com/scylladb/scylladb:
group0: fix indentation after previous patch
raft_group0: simplify get_group0_upgrade_state function since no upgrade can happen any more
raft_group0: move service::group0_upgrade_state to use fmt::formatter instead of iostream
raft_group0: remove unused code from raft_group0
node_ops: remove topology over node ops code
topology: fix indentation after the previous patch
topology: drop topology_change_enabled parameter from raft_group0 code
storage_service: remove unused handle_state_* functions
gossiper: drop wait_for_gossip_to_settle and deprecate correspondent option
storage_service: fix indentation after the last patch
storage_service: remove gossiper bootstrapping code
storage_service: drop get_group_server_if_raft_topolgy_enabled
storage_service: drop is_topology_coordinator_enabled and its uses
storage_service: drop run_with_api_lock_in_gossiper_mode_only
topology: remove code that assumes raft_topology_change_enabled() may return false
test: schema_change_test: make test_schema_digest_does_not_change_with_disabled_features tests run in raft mode
test: schema_change_test: drop schema tests relevant for no raft mode only
topology: remove upgrade to raft topology code
group0: remove upgrade to group0 code
group0: refuse to boot if a cluster is still is not in a raft topology mode
storage_service: refuse to join a cluster in legacy mode
This commit adds the upgrade guide for version 2026.1.
According to the new upgrade policy, the user can now upgrade to the major version (2026.1)
from any previous minor version.
So instead of adding a separate guide form 2025.4 to 2026.1, we need a guide from 2025.x to 2026.1.
In addition, this commit:
- Updates the upgrade policy for reflect the above change.
- Removes the upgrade guides for the previous version.
Fixes https://github.com/scylladb/scylladb/issues/28533
Fixes https://github.com/scylladb/scylladb/issues/28532Closesscylladb/scylladb#28789
assert(), and SCYLLA_ASSERT() are evil (Refs #7871) because they can cause the entire Scylla cluster to crash mysteriously instead of cleanly failing the specific request that encountered a serious problem of failed pre-requisite.
In this two-patch series, in the first patch we introduce a new macro throwing_assert(), a convenient drop-in replacement for SCYLLA_ASSERT() but which has all the benefits of on_internal_error() instead of the dangers of SCYLLA_ASSERT().
In the second patch we use the new function to replace every call to SCYLLA_ASSERT() in Alternator by the new throwing_assert().
Here is an example from the second patch to demonstrate the power of this approach: The Alternator code uses the attrs_column() function to retrieve the ":attrs" column of a schema. Since every Alternator table always has an ":attrs" column in its schema, we felt safe to SCYLLA_ASSERT() that this column exists. However, imagine that one day because of a bug, one Alternator table is missing this column. Or maybe not a bug - maybe a malicious user on a shared cluster found a way to deliberately delete this column (e.g, with a CQL command!) and this check fails. Before this patch, the entire Scylla node will crash. If the same request is sent to all nodes - the entire cluster will crash. The user might not even know which request caused this crash. In contrast, after this patch, the specific operation - e.g., PutItem - will get an exception. Only this operation, and nothing else, will be aborted, and the user who sent this request will even get an "Internal Server Error" with the assertion-failure message, alerting them that this specific query is causing problems, while other queries might work normally.
There's no need to backport this patch - unless it becomes annoying that other branches don't have the throwing_assert() function and we want it to ease other backports.
Fixes#28308.
Closesscylladb/scylladb#28445
* github.com:scylladb/scylladb:
alternator: replace SCYLLA_ASSERT with throwing_assert
utils: introduce throwing_assert(), a safe replacement for assert
In nonroot installations, the install.sh script was hardcoding the
api_ui_dir and api_doc_dir paths to /opt/scylladb/ in scylla.yaml,
even though the actual files were installed to a different location
(typically ~/scylladb). This caused REST API endpoints like
/api-doc/failure_detector/ to fail with "transfer closed with
outstanding read data remaining" error because Scylla couldn't find
the API documentation files at the configured paths.
Fix this by using the $prefix variable instead of hardcoded
/opt/scylladb/ paths. This ensures that:
- In regular installations: $prefix = /opt/scylladb (no change)
- In nonroot installations: $prefix = ~/scylladb (paths now correct)
Fixes: SCYLLADB-721
Backport: The hardcoded paths in install.sh have been present since
the nonroot installation feature was introduced, making REST API
endpoints non-functional in all nonroot installations across all
live versions of Scylla.
Closesscylladb/scylladb#28805
The concurrency semaphore gates uninitialized connections across all
do_accepts loops, but was initialized to a fixed value regardless of
how many listeners exist. With multiple listeners competing for the
same units, each effectively gets less than the configured concurrency.
Initialize the semaphore to concurrency - 1 and signal 1 per listen()
call, so total capacity is concurrency - 1 + nr_listeners. This
guarantees each listener's accept loop can have at least one unit
available.
It mainly fixes problem when setting uninitialized_connections_semaphore_cpu_concurrency
config value to 1 would result in not being able to process connections, as only 1 out of 2
listeners got the semaphore.
Fixes https://scylladb.atlassian.net/browse/SCYLLADB-762
Backport: no, it's a minor problem
Closesscylladb/scylladb#28747
* github.com:scylladb/scylladb:
test: add test_uninitialized_conns_semaphore
generic_server: fix waiters count in shed log
generic_server: scale connection concurrency semaphore by listener count
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
co_await cstate.compaction_done.wait([..] {
return num_runs_for_compaction() <= threshold
|| !can_perform_regular_compaction(t);
});
```
to a while loop with a predicated wait:
```
while (can_perform_regular_compaction(t)
&& co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
}
```
This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).
However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.
This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.
Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610Closesscylladb/scylladb#28801
This series implements a new per-row TTL feature for CQL. The per-row TTL feature was requested in issue #13000. It is a feature that does not exist in Cassandra, and was inspired by DynamoDB's TTL feature - and under the hood uses the same implementation that we used in Alternator to implement this DynamoDB feature.
The new per-row TTL feature is completely separate from CQL's existing per-write (and per-cell) TTL, and both will be available to users.
In the per-row TTL feature, one column in the table is designated as the "TTL" column, and its value for a row is the expiration time for that row. The TTL column can be designated at table creation time, e.g.:
```cql
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
```
Or after the table already exists with:
```cql
ALTER TABLE tab TTL expiration
```
Expiration can also be disabled, with:
```cql
ALTER TABLE tab TTL NULL
```
The new per-row TTL feature has two features that users have been asking for:
1. A user can change the value of just the TTL column - without rewriting the entire row - to change the expiration time of the entire row.
2. When an expired row is finally deleted, a CDC event about this deletion appears in the CDC log (if CDC is enabled), including - if a preimage is enabled - the content of the deleted row.
To achieve the second goal (CDC events), a row is not guaranteed to disappear at exactly its expiration time (as CQL's original TTL feature guarantees). Rather, the row is deleted some time later, depending on `alternator_ttl_period_in_seconds`; Until the actual deletion, the row is still readable (and even writable). But we are guaranteed that when the row is finally deleted, the CDC event will come too.
The implementation uses the same background thread used by Alternator to periodically scan for expired items and delete them.
The expiration thread keeps the same metrics as it did for Alternator:
* `scylla_expiration_scan_passes`
* `scylla_expiration_scan_table`
* `scylla_expiration_items_deleted`
* `scylla_expiration_secondary_ranges_scanned`
The series begins with a few small preparation patches, followed by the main part of the feature (which isn't big, since we are just enabling the pre-existing Alternator expiration machinary for CQL) and finally 30 tests (single-node and multi-node tests) and documentation.
This series is a new feature, so traditionally would not be backported. However, I wouldn't be surprised if we will be requested to backport it so that customers will not need to wait for a new major release.
Fixes#13000Closesscylladb/scylladb#28320
* github.com:scylladb/scylladb:
test/cqlpy: verify that a column can't be both STATIC and PRIMARY KEY
docs/cql: document the new CQL per-row TTL feature
test/cluster: tests for the new CQL per-row TTL feature
test/cqlpy: tests for the new CQL per-row TTL feature
test: set low alternator_ttl_period_in_seconds in CQL tests
cql ttl: fix ALTER TABLE to disable TTL if column is dropped
cql ttl: add setting/unsetting of TTL column to ALTER TABLE
cql ttl: add TTL column support to CREATE TABLE and DESC TABLE
ttl: add CQL support to Alternator's TTL expiration service
alternator ttl: move TTL_TAG_KEY to a header file
alternator ttl: remove unnecessary check of feature flag
cql: add "cql_row_ttl" cluster feature
alternator: fix error message if UpdateTimeToLive is not supported
Switch vector dimension handling to fixed-width `uint32_t` type,
update parsing/validation, and add boundary tests.
The dimension is parsed as `unsigned long` at first which is guaranteed
to be **at least** 32-bit long, which is safe to downcast to `uint32_t`.
Move `MAX_VECTOR_DIMENSION` from `cql3_type::raw_vector` to `cql3_type`
to ensure public visibility for checks outside the class.
Add tests to verify the type boundaries.
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-223
Signed-off-by: Yaniv Kaul <yaniv.kaul@scylladb.com>
Co-authored-by: Dawid Pawlik <dawid.pawlik@scylladb.com>
Closesscylladb/scylladb#28762
The test test_build_view_with_large_row creates a materialized view and
expects the view to be built with a timeout of 5 seconds. It was
observed to fail because the timeout is too short on slow machines.
Increase the timeout to 60 seconds to make the test less flaky on slow
machines. Similarly for the other tests in the file that have a timeout
for view build, increase the timeout to 60 seconds to be consistent and
safer.
Fixes SCYLLADB-769
Closesscylladb/scylladb#28817
We have observed a bug that caused Scylla to crash due to metrics double
registration. This bug is really difficult to reproduce and was seen
only once in the wild. We think that it may be caused by a request
in-flight keeping a reference to the stats object, making it not
deregister when the index is dropped, which casues a double registration
when we recreate the index, however we are not 100% sure.
This patch makes it so the metrics always get deregistered when we drop the index, which should fix the double registration bug.
Fixes: #27252Closesscylladb/scylladb#28655
Audit tests have been slow. They rely on wait_for function. This function first sleeps for the duration of the time step specified, and then calls the given function. The audit tests need 0.02-0.03 seconds for the given function, but the operation lasts around 1.02-1.03 seconds, since step is 1 second.
This patch modifies wait_for dtest function so it first executes the given function, and afterwards calls time.sleep(step). This reduces time needed for the given function from 1.03 to 0.03 seconds.
Total audit tests suite speedup is 3x. On the developer machine the time is reduced from 13+ minutes to 4 minutes.
This patch also improves performance of some alternator tests that use the same wait_for dtest function.
`wait_for` in dtest framework has default time step reduced to make the environment more responsive and test execution faster.
Refs SCYLLADB-573
This is a performance improvement of testing framework. No need to backport.
Closesscylladb/scylladb#28590
* github.com:scylladb/scylladb:
dtest: shorten default sleep step in wait_for
dtest: wait_for speedup
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.
The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)
Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR
Closesscylladb/scylladb#28804
While adding the new syntax "TTL" to CREATE TABLE, I noticed that the
parser actually allows a column to be defined as "STATIC PRIMARY KEY".
So I add here a small test to verify that this is not really allowed:
The syntax "c int STATIC PRIMARY KEY" is accepted, but then rejected
by a later check. The syntax "c int PRIMARY KEY STATIC" is rejected
as a syntax error.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Add user-facing documentation for the new CQL per-row TTL feature,
in docs/cql/cql-extensions.md.
Also mention (and link) the new alternative TTL feature in a few
relevant documents about the old (per-write) TTL, about CDC,
and about the CREATE TABLE and ALTER TABLE commands.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added single-node functional tests (in test/cqlpy)
for everything which was possible to test on a single node. In this
patch we add four tests that we couldn't test on a single node, using
the test/cluster test framework:
1. Test that the TTL expiration work - both the scanning threads and
the actual deletion work on all nodes - happens on the "streaming"
scheduling group.
2. Test that even if one of the cluster's nodes is down, still all
the items get expired - another node "takes over" the dead node's
work.
3. Test that rolling upgrade works as designed for the CQL per-row TTL
feature: Before every single node in the cluster is upgraded to
support this feature, a TTL column cannot be enabled on a table.
And as soon as the last node of the cluster is upgraded, the TTL
feature begins to work completely (you don't need to reboot all
the nodes again).
4. Test that expiration works correctly on a multi-DC setup. The test
doesn't check the efficiency of this process - i.e., that today each
DC scans part of the data, reading with LOCAL_QUORUM, and writing
the deletions across the entire cluster. Rather, the test only
verifies the correctness - that expired rows do get deleted -
for the usual case the data across the DCs is consistent.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch contains 27 functional tests (in the test/cqlpy framework)
for the new CQL per-row TTL feature. The tests cover the TTL column
configuration statements (CREATE TABLE, ALTER TABLE) as well as the
actual item expiration or non-expiration depending on the value of
the expiration-time column - and also CDC events generated on expiration
and the metrics generated by the expiration process.
These tests were written together with the code, as in "test-driven
development", so they aim to cover every corner case considered during
the development, and they reproduce every bug and misstep seen during
the development process. As a result, they hopefully achieve very high
code coverage - but since we don't have a working code-coverage tool,
I can't report any specific code coverage numbers.
These tests check everything which we can check on single-node cluster.
The next patch will add additional multi-node tests for things we can't
check here with a single node - such as the scheduling group used by the
distributed work, the effect of dead nodes on the TTL functionality, and
the process of rolling upgrade.
The tests in this patch do NOT try to stress the background expiration
scanning threads, or to check how they handle topology changes, large
amounts of data or clusters spanning multiple DCs. These tests also don't
test the performance impact of these scanning threads. Because the
expiration scanning thread is identical to the one already used by
Alternator TTL, we assume that many of these aspects were already tested
for Alternator TTL and did not change when the same implementation is
used for the new CQL feature.
All new tests pass on ScyllaDB. Because the per-row TTL feature is
a new ScyllaDB feature that does not exist on Cassandra, all these
tests are skipped on Cassandra.
Because some of these tests involve waiting for expiration, they can't
be very quick. Still, because we set alternator_ttl_period_in_seconds
to 0.5 seconds in the test framework, all 27 tests running sequentially
finish in roughly 6 seconds total, which we consider acceptable.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
In test/alternator/run we set alternator_ttl_period_in_seconds to a very
low number (0.5 seconds) to allow TTL tests to expire items very quickly
and finish quickly.
Until now, we didn't need to do this for CQL tests, because they weren't
using this Alternator-only feature. Now that CQL uses the same expiration
feature with its original configuration parameter, we need to set it in
CQL tests too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
If "ALTER TABLE tab DROP x" is done to delete column x, and column x was
the designated TTL column, then the per-row TTL feature should be disabled
on this table.
If we don't do this, the expiration scanner will continue to scan the
table trying to read the dropped column - which will be wasteful or
worse.
A test for this case is also included in test/cqlpy/test_ttl_row.py
in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The previous patch added the ability in CREATE TABLE to designate one
of the regular columns as a "TTL column", to be used by the per-row TTL
feature (Refs #13000). In this patch we add to ALTER TABLE the ability
to enable per-row TTL on an existing table with a given column as the
TTL column:
ALTER TABLE tab TTL colname
and also the ability to disable per-row TTL with
ALTER TABLE tab TTL NULL
as in CREATE TABLE, the designated TTL column must be a regular column
(it can't be a primary key column or a static column), and must have
the types timestamp, bigint or int.
You can't enable per-row TTL if already enabled, or disable it if
already disabled. To change the TTL column on an existing table,
you must first disable TTL, and then re-enable it with the new column.
A large collection of functional tests (in test/cqlpy), for every detail
of this patch, will come in a later patch in this series.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch enables the per-row TTL feature in CQL (Refs #13000).
This patch allows the user to create a new table with one of its columns
designated as the TTL column with a syntax like:
CREATE TABLE tab (
id int PRIMARY KEY,
t text,
expiration timestamp TTL
);
The column marked "TTL" must have the "timestamp", "bigint" or "int"
types (the choice of these types was explained in the previous patch),
and there can only be one such column. We decided not to allow a column
to be both a primary key column and a TTL column - although it would
have worked (it's supported in Alternator), I considered this non-useful
and confusing, and decided not to allow it in CQL. A TTL column also
can't be a static column.
We save the information of which column is the TTL column in a tag which
is read by the "expiration service" - originally a part of Alternator's
TTL implementation. After the previous patch, the expiration service is
running and knows how to understand CQL tables, so the CQL per-row TTL
feature will start to work.
This patch also implements DESC TABLE, printing the word "TTL" in the
right place of the output.
This patch doesn't yet implement ALTER TABLE that should allow enabling
or disabling the TTL column setting on an existing table - we'll do that
in the next patch.
A large collection of functional tests (in test/cqlpy), for every detail
of this feature will be added in a later patch.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
The Alternator TTL feature uses an "expiration service", a background
thread on each shard which periodically scans for expired items and
deletes them. When writing the expiration service, we already
anticipated that the day will come that we'll want to use it for CQL
too. Well, now that we want to use it for CQL, we only need to make
two changes:
1. Before this patch, the expiration service was only started if
Alternator was enabled. Now we need to start it unconditionally,
as both Alternator and CQL will need to use it.
The performance impact of the new background threads, when not
needed, should be minimal: These threads will wake up every
alternator_ttl_period_in_seconds (by default - once a day) and
just check if any table has per-row TTL enabled, and if not, do
nothing.
2. Before this patch, the expiration-time column had to be of type
"decimal" - a variable-precision floating-point type. This made
sense in Alternator - where all numbers are of this type, but CQL
offers better and more efficient types for this purpose. In this
patch we add support for two additional types for the expiration
time column: The "timestamp" type (which uses millisecond precision,
which our implementation truncates to whole seconds) and for the
"bigint" type storing a number of seconds since the UNIX epoch.
We also support the smaller "int" type for compatibility with
existing data, but it is not recommended because a signed
32-bit integer counting time from 1970 will break in 2038.
After this patch, the expiration service supports CQL tables, but there
is nothing yet that can enable it on CQL tables - i.e., nothing that
sets the appropriate tag on the table to tell the expiration service
which column is the expiration-time column. We'll add new syntax to
do this in the next patch.
At the moment, we leave the expiration service implementation in
its existing location - alternator/ttl.cc. This is despite the fact
that we now start it and use it also for CQL. For better modularity,
we should probably later move the expiration service implementation
to a separate module (directory).
Similarly, the expiration service's period is still configured via
alternator_ttl_period_in_seconds, which is now a misnomer because it
also affects CQL. Later we can rename this configuration parameter,
or alternatively, consider different scan periods for different tables
and table types, and have separate configuration for Alternator TTL
and CQL per-row TTL.
The metrics kept by the expiration service are the same metrics existing
for Alternator TTL, and fortunately do not have the name "alternator" in
their name:
* scylla_expiration_scan_passes
* scylla_expiration_scan_table
* scylla_expiration_items_deleted
* scylla_expiration_secondary_ranges_scanned
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
TTL_TAG_KEY stores the name of the tag in which we store the name of the
table's expiration-time column, for Alternator's TTL feature.
We already need this name in two source files, and soon we'll need it
in more files - as we want to use the same implementation also for for
a new per-row TTL feature in CQL. So it's time to move the declaration
of this variable to a new header file - alternator/ttl_tag.hh.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Every node that supports the Alternator TTL feature should start its
background expiration-checking thread, *without* checking if other
nodes support this feature. This patch removes the unnecessary check.
Indeed, until all other nodes enable this feature, the background thread
will have nothing to do. but when finally all nodes have this feature -
we need this thread to already be on - without requiring another reboot
of all nodes to start this thread.
In practice, this change won't change anything on modern installations
because this feature is already three years old and always enabled on
modern clusters. But I don't want to repeat the same mistake for the new
CQL per-row TTL feature, so better fix it in Alternator too.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
This patch adds a new cluster feature "CQL_ROW_TTL", for the new CQL
per-row TTL feature.
With this patch, this node reports supporting this feature, but the CQL
per-row TTL feature can only be used once all the nodes in the cluster
supports the feature. In other words, user requests to enable per-row TTL
on a table should check this feature flag (on the whole cluster) before
proceeding.
This is needed because the implementation of the per-row-TTL expiration
requires the cooperation of all nodes to participate in scanning for
expired items, so the feature can't be trusted until all nodes participate
in it.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>