Compare commits

...

580 Commits

Author SHA1 Message Date
Patryk Jędrzejczak
be942e9a4f test: test_remove_garbage_group0_members: wait for token ring and group0 consistency before removenode
The removenove initiator could have an outdated token ring (still considering
the node removed by the previous removenode a token owner) and unexpectedly
reject the operation.

Fix that by waiting for token ring and group0 consistency before removenode.
Note that the test already checks that consistency, but only for one node,
which is different from the removenode initiator.

This test has been removed in master together with the code being tested
(the gossip-based topology). Hence, the fix is submitted directly to 2026.1.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-1103

Backport to all supported branches (other than 2026.1), as the test can fail
there.

Closes scylladb/scylladb#29108

(cherry picked from commit 1398a55d16)

Closes scylladb/scylladb#29205
2026-03-24 16:09:02 +01:00
Pavel Emelyanov
e212762ab7 database: Rate limit all tokens from a range
The limiter scans ranges to decide whether or not to rate-limit the
query. However, when considering each range only the front one's token
is accounted. This looks like a misprint.

The limiter was introduced in cc9a2ad41f

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#29050

(cherry picked from commit 8b1ca6dcd6)

Closes scylladb/scylladb#29107

Closes scylladb/scylladb#29194
2026-03-24 16:04:01 +02:00
Botond Dénes
a41d1ec711 Merge 'doc: fix the installation section' from Anna Stuchlik
This PR fixes the Installation page:

- Replaces `http `with `https `in the download command.
- Replaces the Open Source example from the Installation section for CentOS (we overlooked this example before).

Fixes https://github.com/scylladb/scylladb/issues/29087
Fixes https://github.com/scylladb/scylladb/issues/29087

This update affects all supported versions and should be backported as a bug fix.

Closes scylladb/scylladb#29088

* github.com:scylladb/scylladb:
  doc: remove the Open Source Example from Installation
  doc: replace http with https in the installation instructions

(cherry picked from commit e8b37d1a89)

Closes scylladb/scylladb#29135

Closes scylladb/scylladb#29192
2026-03-23 23:50:15 +02:00
Yaron Kaikov
a4b4c4c0a8 .github/workflows/trigger-scylla-ci: fix heredoc injection in trigger-scylla-ci workflow
Move all ${{ }} expression interpolations into env: blocks so they are
passed as environment variables instead of being expanded directly into
shell scripts. This prevents an attacker from escaping the heredoc in
the Validate Comment Trigger step and executing arbitrary commands on
the runner.

The Verify Org Membership step is hardened in the same way for
defense-in-depth.

Refs: GHSA-9pmq-v59g-8fxp
Fixes: SCYLLADB-954

Closes scylladb/scylladb#28935

(cherry picked from commit 977bdd6260)

Closes scylladb/scylladb#28947
2026-03-20 11:00:38 +02:00
Botond Dénes
155d12f4c9 mutation/collection_mutation: don't copy the serialized collection
serialize_collection_mutation() copies the serialized collection into
the returned collection_mutation object. Change to move to avoid the
copy.

Fixes: SCYLLADB-1041

Closes scylladb/scylladb#29010

(cherry picked from commit 15cfa5beeb)

Closes scylladb/scylladb#29024

Closes scylladb/scylladb#29037
2026-03-20 11:00:11 +02:00
Aleksandra Martyniuk
e78426c5d4 nodetool: cluster repair: do not fail if a table was dropped
nodetool cluster repair without additional params repairs all tablet
keyspaces in a cluster. Currently, if a table is dropped while
the command is running, all tables are repaired but the command finishes
with a failure.

Modify nodetool cluster repair. If a table wasn't specified
(i.e. all tables are repaired), the command finishes successfully
even if a table was dropped.

If a table was specified and it does not exist (e.g. because it was
dropped before the repair was requested), then the behavior remains
unchanged.

Fixes: SCYLLADB-568.

Closes scylladb/scylladb#28739

(cherry picked from commit 2e68f48068)

Closes scylladb/scylladb#29006

Closes scylladb/scylladb#29038
2026-03-20 10:59:26 +02:00
Anna Stuchlik
11248e5cef doc: update the warning about shared dictionary training
This commit updates the inadequate warning on the Advanced Internode (RPC) Compression page.

The warning is replaced with a note about how training data is encrypted.

Fixes https://github.com/scylladb/scylladb/issues/29109

Closes scylladb/scylladb#29111

(cherry picked from commit 88b98fac3a)

Closes scylladb/scylladb#29119

Closes scylladb/scylladb#29139
2026-03-20 10:58:59 +02:00
Avi Kivity
1f7dca0225 Merge 'Fix bad performance for densely populated partition index pages' from Tomasz Grabiec
This applies to small partition workload where index pages have high partition count, and the index doesn't fit in cache. It was observed that the count can be in the order of hundreds. In such a workload pages undergo constant population, LSA compaction, and LSA eviction, which has severe impact on CPU utilization.

Refs https://scylladb.atlassian.net/browse/SCYLLADB-620

This PR reduces the impact by several changes:

  - reducing memory footprint in the partition index. Assuming partition key size is 16 bytes, the cost dropped from 96 bytes to 36 bytes per partition.

  - flattening the object graph and amortizing storage. Storing entries directly in the vector. Storing all key values in a single managed_bytes. Making index_entry a trivial struct.

  - index entries and key storage are now trivially moveable, and batched inside vector storage
    so LSA migration can use memcpy(), which amortizes the cost per key. This reduces the cost of LSA segment compaction.

 - LSA eviction is now pretty much constant time for the whole page
   regardless of the number of entries, because elements are trivial and batched inside vectors.
   Page eviction cost dropped from 50 us to 1 us.

Performance evaluated with:

   scylla perf-simple-query -c1 -m200M --partitions=1000000

Before:

```
7774.96 tps (166.0 allocs/op, 521.7 logallocs/op,  54.0 tasks/op,  802428 insns/op,  430457 cycles/op,        0 errors)
7511.08 tps (166.1 allocs/op, 527.2 logallocs/op,  54.0 tasks/op,  804185 insns/op,  430752 cycles/op,        0 errors)
7740.44 tps (166.3 allocs/op, 526.2 logallocs/op,  54.2 tasks/op,  805347 insns/op,  432117 cycles/op,        0 errors)
7818.72 tps (165.2 allocs/op, 517.6 logallocs/op,  53.7 tasks/op,  794965 insns/op,  427751 cycles/op,        0 errors)
7865.49 tps (165.1 allocs/op, 513.3 logallocs/op,  53.6 tasks/op,  788898 insns/op,  425171 cycles/op,        0 errors)
```

After (+318%):

```
32492.40 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109236 insns/op,  103203 cycles/op,        0 errors)
32591.99 tps (130.4 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  108947 insns/op,  102889 cycles/op,        0 errors)
32514.52 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109118 insns/op,  103219 cycles/op,        0 errors)
32491.14 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109349 insns/op,  103272 cycles/op,        0 errors)
32582.90 tps (130.5 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109269 insns/op,  102872 cycles/op,        0 errors)
32479.43 tps (130.6 allocs/op,  12.8 logallocs/op,  36.0 tasks/op,  109313 insns/op,  103242 cycles/op,        0 errors)
32418.48 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109201 insns/op,  103301 cycles/op,        0 errors)
31394.14 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109267 insns/op,  103301 cycles/op,        0 errors)
32298.55 tps (130.7 allocs/op,  12.8 logallocs/op,  36.1 tasks/op,  109323 insns/op,  103551 cycles/op,        0 errors)
```

When the workload is miss-only, with both row cache and index cache disabled (no cache maintenance cost):

  perf-simple-query -c1 -m200M --duration 6000 --partitions=100000 --enable-index-cache=0 --enable-cache=0

Before:

```
9124.57 tps (146.2 allocs/op, 789.0 logallocs/op,  45.3 tasks/op,  889320 insns/op,  357937 cycles/op,        0 errors)
9437.23 tps (146.1 allocs/op, 789.3 logallocs/op,  45.3 tasks/op,  889613 insns/op,  357782 cycles/op,        0 errors)
9455.65 tps (146.0 allocs/op, 787.4 logallocs/op,  45.2 tasks/op,  887606 insns/op,  357167 cycles/op,        0 errors)
9451.22 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887627 insns/op,  357357 cycles/op,        0 errors)
9429.50 tps (146.0 allocs/op, 787.4 logallocs/op,  45.3 tasks/op,  887761 insns/op,  358148 cycles/op,        0 errors)
9430.29 tps (146.1 allocs/op, 788.2 logallocs/op,  45.3 tasks/op,  888501 insns/op,  357679 cycles/op,        0 errors)
9454.08 tps (146.0 allocs/op, 787.3 logallocs/op,  45.3 tasks/op,  887545 insns/op,  357132 cycles/op,        0 errors)
```

After (+55%):

```
14484.84 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396164 insns/op,  229490 cycles/op,        0 errors)
14526.21 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396401 insns/op,  228824 cycles/op,        0 errors)
14567.53 tps (150.7 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  396319 insns/op,  228701 cycles/op,        0 errors)
14545.63 tps (150.6 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395889 insns/op,  228493 cycles/op,        0 errors)
14626.06 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395254 insns/op,  227891 cycles/op,        0 errors)
14593.74 tps (150.5 allocs/op,   6.5 logallocs/op,  44.7 tasks/op,  395480 insns/op,  227993 cycles/op,        0 errors)
14538.10 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  397035 insns/op,  228831 cycles/op,        0 errors)
14527.18 tps (150.8 allocs/op,   6.5 logallocs/op,  44.8 tasks/op,  396992 insns/op,  228839 cycles/op,        0 errors)
```

Same as above, but with summary ratio increased from 0.0005 to 0.005 (smaller pages):

Before:

```
33906.70 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170553 insns/op,   98104 cycles/op,        0 errors)
32696.16 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170369 insns/op,   98405 cycles/op,        0 errors)
33889.05 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170551 insns/op,   98135 cycles/op,        0 errors)
33893.24 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170488 insns/op,   98168 cycles/op,        0 errors)
33836.73 tps (146.1 allocs/op,  83.6 logallocs/op,  45.1 tasks/op,  170528 insns/op,   98226 cycles/op,        0 errors)
33897.61 tps (146.0 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170428 insns/op,   98081 cycles/op,        0 errors)
33834.73 tps (146.1 allocs/op,  83.5 logallocs/op,  45.1 tasks/op,  170438 insns/op,   98178 cycles/op,        0 errors)
33776.31 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170958 insns/op,   98418 cycles/op,        0 errors)
33808.08 tps (146.3 allocs/op,  83.9 logallocs/op,  45.2 tasks/op,  170940 insns/op,   98388 cycles/op,        0 errors)
```

After (+18%):

```
40081.51 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121047 insns/op,   82231 cycles/op,        0 errors)
40005.85 tps (148.6 allocs/op,   4.4 logallocs/op,  45.2 tasks/op,  121327 insns/op,   82545 cycles/op,        0 errors)
39816.75 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121067 insns/op,   82419 cycles/op,        0 errors)
39953.11 tps (148.1 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82258 cycles/op,        0 errors)
40073.96 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121006 insns/op,   82313 cycles/op,        0 errors)
39882.25 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  120925 insns/op,   82320 cycles/op,        0 errors)
39916.08 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121054 insns/op,   82393 cycles/op,        0 errors)
39786.30 tps (148.2 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121027 insns/op,   82465 cycles/op,        0 errors)
38662.45 tps (148.3 allocs/op,   4.4 logallocs/op,  45.0 tasks/op,  121108 insns/op,   82312 cycles/op,        0 errors)
39849.42 tps (148.3 allocs/op,   4.4 logallocs/op,  45.1 tasks/op,  121098 insns/op,   82447 cycles/op,        0 errors)
```

Closes scylladb/scylladb#28603

* github.com:scylladb/scylladb:
  sstables: mx: index_reader: Optimize parsing for no promoted index case
  vint: Use std::countl_zero()
  test: sstable_partition_index_cache_test: Validate scenario of pages with sparse promoted index placement
  sstables: mx: index_reader: Amoritze partition key storage
  managed_bytes: Hoist write_fragmented() to common header
  utils: managed_vector: Use std::uninitialized_move() to move objects
  sstables: mx: index_reader: Keep promoted_index info next to index_entry
  sstables: mx: index_reader: Extract partition_index_page::clear_gently()
  sstables: mx: index_reader: Shave-off 16 bytes from index_entry by using raw_token
  sstables: mx: index_reader: Reduce allocation_section overhead during index page parsing by batching allocation
  sstables: mx: index_reader: Keep index_entry directly in the vector
  dht: Introduce raw_token
  test: perf_simple_query: Add 'sstable-format' command-line option
  test: perf_simple_query: Add 'sstable-summary-ratio' command-line option
  test: perf-simple-query: Add option to disable index cache
  test: cql_test_env: Respect enable-index-cache config

(cherry picked from commit 5e7fb08bf3)

Closes scylladb/scylladb#29136

Closes scylladb/scylladb#29140
2026-03-20 10:58:26 +02:00
Piotr Dulikowski
6b9aa303d8 Merge '[Backport 2026.1] mv: allow skipping view updates when a collection is unmodified' from Scylladb[bot]
mv: allow skipping view updates when a collection is unmodified
When we generate view updates, we check whether we can skip the
entire view update if all columns selected by the view are unmodified.
However, for collection columns, we only check if they were unset
before and after the update.
In this patch we add a check for the actual collection contents.
We perform this check for both virtual and non-virtual selections.
When the column is only a virtual column in the view, it would be
enough to check the liveness of each collection cell, however for
that we'd need to deserialize the entire collection anyway, which
should be effectively as expensive as comparing all of its bytes.

Fixes: SCYLLADB-996

- (cherry picked from commit 01ddc17ab9)

Parent PR: #28839

Closes scylladb/scylladb#28977

* github.com:scylladb/scylladb:
  Merge 'mv: allow skipping view updates when a collection is unmodified' from Wojciech Mitros
  mv: remove dead code in view_updates::can_skip_view_updates

Closes scylladb/scylladb#29094
2026-03-18 10:41:50 +01:00
Patryk Jędrzejczak
3863dfbc0a test: test_raft_no_quorum: decrease group0_raft_op_timeout_in_ms after quorum loss
`test_raft_no_quorum.py::test_cannot_add_new_node` is currently flaky in dev
mode. The bootstrap of the first node can fail due to `add_entry()` timing
out (with the 1s timeout set by the test case).

Other test cases in this test file could fail in the same way as well, so we
need a general fix. We don't want to increase the timeout in dev mode, as it
would slow down the test. The solution is to keep the timeout unchanged, but
set it only after quorum is lost. This prevents unexpected timeouts of group0
operations with almost no impact on the test running time.

A note about the new `update_group0_raft_op_timeout` function: waiting for
the log seems to be necessary only for
`test_quorum_lost_during_node_join_response_handler`, but let's do it
for all test cases just in case (including `test_can_restart` that shouldn't
be flaky currently).

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-913

Closes scylladb/scylladb#28998

(cherry picked from commit 526e5986fe)

Closes scylladb/scylladb#29068

Closes scylladb/scylladb#29097
2026-03-18 10:15:34 +01:00
Tomasz Grabiec
0c786045ff Merge 'service: assert that tables updated via group0 use schema commitlog' from Aleksandra Martyniuk
Set enable_schema_commitlog for each group0 tables.

Assert that group0 tables use schema commitlog in ensure_group0_schema
(per each command).

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-914.

Needs backport to all live releases as all are vulnerable

Closes scylladb/scylladb#28876

* github.com:scylladb/scylladb:
  test: add test_group0_tables_use_schema_commitlog
  db: service: remove group0 tables from schema commitlog schema initializer
  service: ensure that tables updated via group0 use schema commitlog
  db: schema: remove set_is_group0_table param

(cherry picked from commit b90fe19a42)

Closes scylladb/scylladb#28916

Closes scylladb/scylladb#28986
2026-03-17 17:29:36 +01:00
Jenkins Promoter
a6f58c154b Update pgo profiles - aarch64 2026-03-15 04:44:34 +02:00
Piotr Dulikowski
00269ca839 Merge '[Backport 2025.4] vector_search: test: fix HTTPS client test flakiness' from Scylladb[bot]
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547
Fixes: SCYLLADB-802
Fixes: SCYLLADB-825
Fixes: SCYLLADB-826

Backport to 2025.4 and 2026.1, as the same problem occurs on these branches and can potentially make the CI flaky there as well.

- (cherry picked from commit bf369326d6)

Parent PR: #28879

Closes scylladb/scylladb#28895

* github.com:scylladb/scylladb:
  vector_search: test: include ANN error in assertion
  vector_search: test: fix HTTPS client test flakiness
2026-03-12 10:23:35 +01:00
Karol Nowacki
560553f654 vector_search: test: include ANN error in assertion
When the test fails, the assertion message does not include
the error from the ANN request.

This change enhances the assertion to include the specific ANN error,
making it easier to diagnose test failures.
2026-03-11 10:17:57 +01:00
Karol Nowacki
9ba8d85c39 vector_search: test: fix HTTPS client test flakiness
The default 100ms timeout for client readiness in tests is too
aggressive. In some test environments, this is not enough time for
client creation, which involves address resolution and TLS certificate
reading, leading to flaky tests.

This commit increases the default client creation timeout to 10 seconds.
This makes the tests more robust, especially in slower execution
environments, and prevents similar flakiness in other test cases.

Fixes: VECTOR-547, SCYLLADB-802
2026-03-11 10:17:36 +01:00
Patryk Jędrzejczak
9152a8d111 test: test_full_shutdown_during_replace: retry replace after the replacing node is removed from gossip
The test is currently flaky with `reuse_ip = True`. The issue is that the
test retries replace before the first replace is rolled back and the
first replacing node is removed from gossip. The second replacing node
can see the entry of the first replacing node in gossip. This entry has
a newer generation than the entry of the node being replaced, and both
replacing nodes have the same IP as the node being replaced. Therefore,
the second replacing node incorrectly considers this entry as the entry
of the node being replaced. This entry is missing rack and DC, so the
second replace fails with
```
ERROR 2026-02-24 21:19:03,420 [shard 0:main] init - Startup failed:
std::runtime_error (Cannot replace node
8762a9d2-3b30-4e66-83a1-98d16c5dd007/127.61.127.1 with a node on
a different data center or rack.
Current location=UNKNOWN_DC/UNKNOWN_RACK, new location=dc1/rack2)
```

Fixes SCYLLADB-805

Closes scylladb/scylladb#28829

(cherry picked from commit ba7f314cdc)

Closes scylladb/scylladb#28953
2026-03-10 16:48:05 +01:00
Anna Stuchlik
b0bb0a3731 doc: fix the unified installer instructions
This commit updates the documentation for the unified installer.

- The Open Source example is replaced with version 2025.1 (Source Available, currently supported, LTS).
- The info about CentOS 7 is removed (no longer supported).
- Java 8 is removed.
- The example for cassandra-stress is removed (as it was already removed on other installation pages).

Fixes https://github.com/scylladb/scylladb/issues/28150

Closes scylladb/scylladb#28152

(cherry picked from commit 855c503c63)

Closes scylladb/scylladb#28910

Closes scylladb/scylladb#28927
2026-03-09 21:40:55 +02:00
Grzegorz Burzyński
b4807abbc4 packaging: add systemctl command to dependencies
scylladb/scylla container image doesn't include systemctl binary, while it
is used by perftune.py script shipped within the same image.

Scylla Operator runs this script to tune Scylla nodes/containers,
expecting its all dependencies to be available in the container's PATH.
Without systemctl, the script fails on systems that run irqbalance
(e.g., on EKS nodes) as the script tries to reconfigure irqbalance and
restart it via systemctl afterwards.

Fixes: scylladb/scylla-operator#3080

Closes scylladb/scylladb#28567

(cherry picked from commit b4f0eb666f)

Closes scylladb/scylladb#28845

(cherry picked from commit 4cc5c2605f)

Closes scylladb/scylladb#28890
2026-03-05 14:52:36 +02:00
Anna Stuchlik
289f9793ff doc: remove reduntant Java-related information
This commit removes:
- Instructions to install scylla-jmx (and all references)
- The Java 11 requirement for Ubuntu.

Fixes https://github.com/scylladb/scylladb/issues/28249
Fixes https://github.com/scylladb/scylladb/issues/28252

Closes scylladb/scylladb#28254

(cherry picked from commit 64b1798513)

Closes scylladb/scylladb#28888
2026-03-05 11:55:40 +01:00
Jenkins Promoter
a1578036d6 Update ScyllaDB version to: 2025.4.6 2026-03-04 11:45:48 +02:00
Botond Dénes
cf5571c93b Merge '[Backport 2025.4] docs: update a documentation of adding/removing DC and rebuilding a node' from Scylladb[bot]
Describe a procedure to convert tablet keyspace replication factor
to rack list. Update the procedures of adding and removing a node
to consider tablet keyspaces.

Fixes: [SCYLLADB-398](https://scylladb.atlassian.net/browse/SCYLLADB-398)
Fixes: https://github.com/scylladb/scylladb/issues/28306.
Fixes: https://github.com/scylladb/scylladb/issues/28307.
Fixes: https://github.com/scylladb/scylladb/issues/28270.

Needs backport to all live branches as they all include tablets.

[SCYLLADB-398]: https://scylladb.atlassian.net/browse/SCYLLADB-398?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

- (cherry picked from commit eefe66b2b2)

- (cherry picked from commit e08ac60161)

- (cherry picked from commit 1c764cf6ea)

- (cherry picked from commit e4c42acd8f)

- (cherry picked from commit 9ccc95808f)

Parent PR: #28521

Closes scylladb/scylladb#28779

* github.com:scylladb/scylladb:
  docs: update nodetool rebuild docs
  docs: update a procedure of decommissioning a DC
  docs: update a procedure of adding a DC
2026-03-03 13:26:18 +02:00
Łukasz Paszkowski
49ee9d9785 test/pylib/util.py: Add retries and additional logging to start_writes()
Consider the following scenario:
1. Let nodes A,B,C form a cluster with RF=3
2. Write query with CL=QUORUM is submitted and is acknowledged by
   nodes B,C
3. Follow-up read query with CL=QUORUM is sent to verify the write
   from the previous step
4. Coordinator sends data/digest requests to the nodes A,B. Since the
   node A is missing data, digest mismatches and data reconciliation
   is triggered
5. The node A or B fails, becomes unavailable, etc
6. During reconciliation, data requests are sent to node A,B and fail
   failing the entire read query

When the above scenario happens, the tests using `start_writes()` fail
with the following stacktrace:
```
...

>           await finish_writes()

test/cluster/test_tablets_migration.py:259:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
test/pylib/util.py:241: in finish
    await asyncio.gather(*tasks)
test/pylib/util.py:227: in do_writes
    raise e
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

worker_id = 1

...

>                   rows = await cql.run_async(rd_stmt, [pk])
E                   cassandra.ReadFailure: Error from server: code=1300 [Replica(s) failed to execute read] message="Operation failed for test_1767777001181_bmsvk.test - received 1 responses and 1 failures from 2 CL=QUORUM." info={'consistency': 'QUORUM', 'required_responses': 2, 'received_responses': 1, 'failures': 1}
```

Note that when a node failure happens before/during a read query,
there is no test failure as the speculative retries are enabled
by default. Hence an additional data/digest read is sent to the third
remaining node.

However, the same speculative read is cancelled the moment, the read
query reaches CL which may trigger a read-repair.

This change:
- Retries the verification read in start_writes() on failure to mitigate
  races between reads and node failures
- Adds additional logging to correlate Python exceptions with Scylla logs

Fixes https://github.com/scylladb/scylladb/issues/27478
Fixes https://github.com/scylladb/scylladb/issues/27974
Fixes https://github.com/scylladb/scylladb/issues/27494
Fixes https://github.com/scylladb/scylladb/issues/23529

Note that this change test flakiness observed during tablet transitions.
However, it serves as a workaround for a higher-level issue
https://github.com/scylladb/scylladb/issues/28125

Closes scylladb/scylladb#28140

(cherry picked from commit e07fe2536e)

Closes scylladb/scylladb#28825
2026-03-03 13:05:41 +02:00
Jenkins Promoter
91c4814c62 Update pgo profiles - aarch64 2026-03-02 21:28:03 +02:00
Jenkins Promoter
b65523b50f Update ScyllaDB version to: 2025.4.5 2026-03-01 16:09:37 +02:00
Jenkins Promoter
9e907ba935 Update pgo profiles - aarch64 2026-03-01 04:31:52 +02:00
Yaron Kaikov
080e04f686 ci: harden trigger-scylla-ci workflow against credential leaks and untrusted PRs
refs: https://github.com/scylladb/scylladb/security/advisories/GHSA-wrqg-xx2q-r3fv

- Remove -v and -i flags from curl to prevent credentials from being
  logged in workflow output
- Move PR_NUMBER and PR_REPO_NAME into the env block with proper quoting
  to prevent shell injection via crafted PR metadata
- Add org membership verification step for pull_request_target events so
  that only PRs from scylladb org members can trigger Jenkins CI

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-796

Closes scylladb/scylladb#28785

(cherry picked from commit 98494e08eb)

Closes scylladb/scylladb#28809
2026-02-27 11:28:11 +02:00
Łukasz Paszkowski
4f5d10ccd0 compaction_manager: fix maybe_wait_for_sstable_count_reduction() hanging forever
The futurization refactoring in 9d3755f276 ("replica: Futurize
retrieval of sstable sets in compaction_group_view") changed
maybe_wait_for_sstable_count_reduction() from a single predicated
wait:
```
    co_await cstate.compaction_done.wait([..] {
        return num_runs_for_compaction() <= threshold
            || !can_perform_regular_compaction(t);
    });
```
to a while loop with a predicated wait:
```
    while (can_perform_regular_compaction(t)
           && co_await num_runs_for_compaction() > threshold) {
        co_await cstate.compaction_done.wait([this, &t] {
            return !can_perform_regular_compaction(t);
        });
    }
```

This was necessary because num_runs_for_compaction() became a
coroutine (returns future<size_t>) and can no longer be called
inside a condition_variable predicate (which must be synchronous).

However, the inner wait's predicate — !can_perform_regular_compaction(t)
— only returns true when compaction is disabled or the table is being
removed. During normal operation, every signal() from compaction_done
wakes the waiter, the predicate returns false, and the waiter
immediately goes back to sleep without ever re-checking the outer
while loop's num_runs_for_compaction() condition.

This causes memtable flushes to hang forever in
maybe_wait_for_sstable_count_reduction() whenever the sstable run
count exceeds the threshold, because completed compactions signal
compaction_done but the signal is swallowed by the predicate.

Fix by replacing the predicated wait with a bare wait(), so that
any signal (including from completed compactions) causes the outer
while loop to re-evaluate num_runs_for_compaction().

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-610

Closes scylladb/scylladb#28801

(cherry picked from commit bb57b0f3b7)
2026-02-27 01:39:05 +02:00
Aleksandra Martyniuk
24bf9ecb14 docs: update nodetool rebuild docs
Update nodetool rebuild docs to mention that the command does not
work for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28270.
(cherry picked from commit 9ccc95808f)
2026-02-26 11:41:05 +01:00
Aleksandra Martyniuk
4a3a716b1a docs: update a procedure of decommissioning a DC
Update a procedure of decommissioning a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28307.
(cherry picked from commit e4c42acd8f)
2026-02-26 11:41:01 +01:00
Aleksandra Martyniuk
1343359641 docs: update a procedure of adding a DC
Update a procedure of adding a DC for tablet keyspaces.

Fixes: https://github.com/scylladb/scylladb/issues/28306.
(cherry picked from commit 1c764cf6ea)
2026-02-26 11:39:02 +01:00
Tomasz Grabiec
d91b93a4fa Fix lambda-coroutine fiasco in hint_endpoint_manager.cc
Found by copilot.

No issue was observed yet.

Fixes #27520

Closes scylladb/scylladb#27477

(cherry picked from commit 7bc59e93b2)

Closes scylladb/scylladb#27732
2026-02-26 10:02:14 +02:00
Aleksandra Martyniuk
a68a80bd9c test: rename duplicate tests
There are two test with name test_repair_options_hosts_tablets in
test/nodetool/test_cluster_repair.py and and two test_repair_keyspace
in test/nodetool/test_repair.py. Due to that one of each pair is ignored.

Rename the tests so that they are unique.

Fixes: https://github.com/scylladb/scylladb/issues/27701.

Closes scylladb/scylladb#27720

(cherry picked from commit bbe64e0e2a)

Closes scylladb/scylladb#27848
2026-02-26 10:01:32 +02:00
Yaron Kaikov
ab43dda5ad ci: fix PR number extraction for unlabeled events
When the workflow is triggered by removing the 'conflicts' label
(pull_request_target unlabeled event), github.event.issue.number is
not available. Use github.event.pull_request.number as fallback.

Fixes: https://scylladb.atlassian.net/browse/RELENG-245

Closes scylladb/scylladb#28543

(cherry picked from commit b30ecb72d5)

Closes scylladb/scylladb#28552
2026-02-26 09:56:58 +02:00
Andrzej Jackowski
1c9d3e14a3 test: fix configuration of test_autoretrain_dict
`test_autoretrain_dict` sporadically fails because the default
compression algorithm was changed after the test was written.

`9ffa62a986815709d0a09c705d2d0caf64776249` was an attempt to fix it by
changing the compression configuration during node startup. However,
the configuration change had an incorrect YAML format and was
ignored by ScyllaDB. This commit fixes it.

Fixes: scylladb/scylladb#28204

Closes scylladb/scylladb#28746

(cherry picked from commit cd4caed3d3)

Closes scylladb/scylladb#28792
2026-02-26 09:55:57 +02:00
Anna Stuchlik
2146f9e4fe doc: remove the tablets limitation for Alternator
This commit removes the information that Alternator doesn't support tablets.
The limitation is no longer valid.

Fixes SCYLLADB-778

Closes scylladb/scylladb#28781

(cherry picked from commit e2333a57ad)

Closes scylladb/scylladb#28793
2026-02-26 09:55:14 +02:00
Yaron Kaikov
1671693e7c .github/workflows: enable automatic backport PR creation with Jira sub-issue integration
This workflow calls the reusable backport-with-jira workflow from
scylladb/github-automation to enable automatic backport PR creation with
Jira sub-issue integration.

The workflow triggers on:
- Push to master/next-*/branch-* branches (for promotion events)
- PR labeled with backport/X.X pattern (for manual backport requests)
- PR closed/merged on version branches (for chain backport processing)

Features enabled by calling the shared workflow:
- Creates Jira sub-issues under the main issue for each backport version
- Sorts versions descending (highest first: 2025.4 -> 2025.3 -> 2025.2)
- Cherry-picks from previous version branch to avoid repeated conflicts
- On Jira API failure: adds comment to main issue, applies 'jira-sub-issue-creation-failed' label, continues with PR

Closes scylladb/scylladb#28804

(cherry picked from commit b211590bc0)

Closes scylladb/scylladb#28813
2026-02-26 09:54:38 +02:00
Marcin Maliszkiewicz
0b3f03f4b7 Merge '[Backport 2025.4] transport: fix connection code to consume only initially taken semaphore units' from Scylladb[bot]
The connection's `cpu_concurrency_t` struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485
Backport: all supported affected versions, bug introduced with initial feature implementation in: ed3e4f33fd

- (cherry picked from commit 0376d16ad3)

- (cherry picked from commit 3b98451776)

Parent PR: #28530

Closes scylladb/scylladb#28715

* github.com:scylladb/scylladb:
  test: decrease strain in test_startup_response
  test: auth_cluster: add test for hanged AUTHENTICATING connections
  transport: fix connection code to consume only initially taken semaphore units
  transport: remove redundant futurize_invoke from counted data sink and source
2026-02-23 13:21:05 +01:00
Marcin Maliszkiewicz
1f1fc2c2ac test: decrease strain in test_startup_response
For 2025.3 and 2025.4 this test runs order of magnitude
slower in debug mode. Potentially due to passwords::check
running in alien thread and overwhelming the CPU (this is
fixed in newer versions).

Decreasing the number of connections in test makes it fast
again, without breaking reproducibility.

As additional measure we double the timeout.
2026-02-20 10:13:55 +01:00
Marcin Maliszkiewicz
b7b7fef02c test: auth_cluster: add test for hanged AUTHENTICATING connections
Test runtime:
Release - 2s
Debug - 5s

(cherry picked from commit 3b98451)
2026-02-19 16:24:03 +01:00
Marcin Maliszkiewicz
6c29f0f425 transport: fix connection code to consume only initially taken semaphore units
The connection's cpu_concurrency_t struct tracks the state of a connection
to manage the admission of new requests and prevent CPU overload during
connection storms. When a connection holds units (allowed only 0 or 1), it is
considered to be in the "CPU state" and contributes to the concurrency limits
used when accepting new connections.

The bug stems from the fact that `counted_data_source_impl::get` and
`counted_data_sink_impl::put` calls can interleave during execution. This
occurs because of `should_parallelize` and `_ready_to_respond`, the latter being
a future chain that can run in the background while requests are being read.
Consequently, while reading request (N), the system may concurrently be
writing the response for request (N-1) on the same connection.

This interleaving allows `return_all()` to be called twice before the
subsequent `consume_units()` is invoked. While the second `return_all()` call
correctly returns 0 units, the matching `consume_units()` call would
mistakenly take an extra unit from the semaphore. Over time, a connection
blocked on a read operation could end up holding an unreturned semaphore
unit. If this pattern repeats across multiple connections, the semaphore
units are eventually depleted, preventing the server from accepting any
new connections.

The fix ensures that we always consume the exact number of units that were
previously returned. With this change, interleaved operations behave as
follows:

get() return_all     — returns 1 unit
put() return_all     — returns 0 units
get() consume_units  — takes back 1 unit
put() consume_units  — takes back 0 units

Logically, the networking phase ends when the first network operation
concludes. But more importantly, when a network operation
starts, we no longer hold any units.

Other solutions are possible but the chosen one seems to be the
simplest and safest to backport.

Fixes SCYLLADB-485

(cherry picked from commit 0376d16)
2026-02-19 16:23:47 +01:00
Marcin Maliszkiewicz
7123df1fcc transport: remove redundant futurize_invoke from counted data sink and source
Closes scylladb/scylladb#27526

(cherry picked from commit d5b63df)
2026-02-19 16:21:06 +01:00
Avi Kivity
c8f324682e Merge '[Backport 2025.4] s3_client: Fix s3 part size and number of parts calculation' from Scylladb[bot]
- Correct `calc_part_size` function since it could return more than 10k parts
- Add tests
- Add more checks in `calc_part_size` to comply with S3 limits

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-640
Must be ported back to 2025.3/4 and 2026.1 since we may encounter this bug in production clusters

- (cherry picked from commit 289e910cec)

- (cherry picked from commit 6280cb91ca)

- (cherry picked from commit 960adbb439)

Parent PR: #28592

Closes scylladb/scylladb#28696

* github.com:scylladb/scylladb:
  s3_client: add more constrains to the calc_part_size
  s3_client: add tests for calc_part_size
  s3_client: correct multipart part-size logic to respect 10k limit
2026-02-19 14:13:26 +02:00
Dawid Mędrek
c21767e606 Merge '[Backport 2025.4] raft topology: generate notification about released nodes only once' from Scylladb[bot]
Hints destined for some other node can only be drained after the other node is no longer a replica of any vnode or tablet. In case when tablets are present, a node might still technically be a replica of some tablets after it moved to left state. When it no longer is a replica of any tablet, it becomes "released" and storage service generates a notification about it. Hinted handoff listens to this notification and kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger every time raft topology state is reloaded and a left node without any tokens is present in the raft topology. Although draining hints is idempotent, generating duplicate notifications is wasteful and recently became very noisy after in 44de563 verbosity of the draining-related log messages have been increased. The verbosity increase itself makes sense as draining is supposed to be a rare operation, but the duplicate notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously released nodes to the `storage_service::raft_topology_update_ip` function and filtering based on it. If this function processes the topology state for the first time, it will not produce any notifications. This is fine as hinted handoff is prepared to detect "released" nodes during the startup sequence in main.cc and start draining the hints there, if needed.

Fixes: scylladb/scylladb#28301
Refs: scylladb/scylladb#25031

The log messages added in 44de563 cause a lot of noise during topology operations and tablet migrations, so the fix should be backported to all affected versions (2025.4 and 2026.1).

- (cherry picked from commit 10e9672852)

- (cherry picked from commit d28c841fa9)

- (cherry picked from commit 29da20744a)

Parent PR: #28367

Closes scylladb/scylladb#28611

* github.com:scylladb/scylladb:
  storage_service: fix indentation after previous patch
  raft topology: generate notification about released nodes only once
  raft topology: extract "released" nodes calculation to external function
2026-02-19 12:31:08 +01:00
Piotr Dulikowski
dc071fc3a2 storage_service: fix indentation after previous patch
(cherry picked from commit 29da20744a)
2026-02-18 19:42:11 +01:00
Piotr Dulikowski
7282a9d9fb raft topology: generate notification about released nodes only once
Hints destined for some other node can only be drained after the other
node is no longer a replica of any vnode or tablet. In case when tablets
are present, a node might still technically be a replica of some tablets
after it moved to left state. When it no longer is a replica of any
tablet, it becomes "released" and storage service generates a
notification about it. Hinted handoff listens to this notification and
kicks off draining hints after getting it.

The current implementation of the "released" notification would trigger
every time raft topology state is reloaded and a left node without any
tokens is present in the raft topology. Although draining hints is
idempotent, generating duplicate notifications is wasteful and recently
became very noisy after in 44de563 verbosity of the draining-related log
messages have been increased. The verbosity increase itself makes sense
as draining is supposed to be a rare operation, but the duplicate
notification bug now needs to be addressed.

Fix the duplicate notification problem by passing the list of previously
released nodes to the `storage_service::raft_topology_update_ip`
function and filtering based on it. If this function processes the
topology state for the first time, it will not produce any
notifications. This is fine as hinted handoff is prepared to detect
"released" nodes during the startup sequence in main.cc and start
draining the hints there, if needed.

Fixes: #28301
Refs: #25031
(cherry picked from commit d28c841fa9)
2026-02-18 19:40:10 +01:00
Piotr Dulikowski
282c3c6a02 raft topology: extract "released" nodes calculation to external function
In the following commits we will need to compare the set of released
nodes before and after reload of raft topology state. Moving the logic
that calculates such a set to a separate function will make it easier to
do.

(cherry picked from commit 10e9672852)
2026-02-18 19:29:51 +01:00
Yehuda Lebi
793e2fa7f2 dist/docker: add configurable blocked-reactor-notify-ms parameter
Add --blocked-reactor-notify-ms argument to allow overriding the default
blocked reactor notification timeout value of 25 ms.

This change provides users the flexibility to customize the reactor
notification timeout as needed.

Fixes: scylladb/scylla-enterprise#5525

Closes scylladb/scylladb#26892

(cherry picked from commit a05ebbbfbb)

Closes scylladb/scylladb#26971
2026-02-18 12:53:55 +02:00
Botond Dénes
83d649e4c0 Merge '[Backport 2025.4] cql3/statements/describe_statement: hide paxos state tables ' from Scylladb[bot]
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes https://github.com/scylladb/scylladb/issues/28183

LWT on tablets and paxos state tables are present in 2025.4, so the patch should be backported to this version.

- (cherry picked from commit f89a8c4ec4)

- (cherry picked from commit 9baaddb613)

Parent PR: #28230

Closes scylladb/scylladb#28507

* github.com:scylladb/scylladb:
  test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
  cql3/statements/describe_statement: hide paxos state tables
2026-02-18 12:48:41 +02:00
Michael Litvak
790b0d5627 migration_listener: fix deadlock in nested notifications
When calling a migration notification from the context of a notification
callback, this could lead to a deadlock with unregistering a listener:
A: the parent notification is called. it calls thread_for_each, where it
   acquires a read lock on the vector of listeners, and calls the
   callback function for each listener while holding the lock.
B: a listener is unregistered. it calls `remove` and tries to acquire a
   write lock on the vector of listeners. it waits because the lock is
   held.
A: the callback function calls another notification and calls
   thread_for_each which tries to acquire the read lock again. but it
   waits since there is a waiter.

Currently we have such concrete scenario when creating a table, where
the callback of `before_create_column_family` in the tablet allocator
calls `before_allocate_tablet_map`, and this could deadlock with node
shutdown where we unregister listeners.

Fix this by not acquiring the read lock again in the nested
notification. There is no need because the read lock is already held by
the parent notification while the child notification is running. We add
a function `thread_for_each_nested` that is similar to `thread_for_each`
except it assumes the read lock is already held and doesn't acquire it,
and it should be used for nested notifications instead of
`thread_for_each`.

Fixes scylladb/scylladb#27364

Closes scylladb/scylladb#27637

(cherry picked from commit 55f4a2b754)

Closes scylladb/scylladb#28557
2026-02-18 12:47:30 +02:00
Botond Dénes
4e9c84321b Merge '[Backport 2025.4] test: cluster: Fix test_sync_point' from Scylladb[bot]
The test `test_sync_point` had a few shortcomings that made it flaky
or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

---

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

As a bonus, we rewrite the auxiliary code responsible for fetching
metrics and manipulating sync points. Now it's asynchronous and
uses the existing standard mechanisms available to developers.

Furthermore, we reduce the time needed for executing
`test_sync_point` by 27 seconds.

---

The total difference in time needed to execute the whole test file
(on my local machine, in dev mode):

Before:

    CPU utilization: 0.9%

    real    2m7.811s
    user    0m25.446s
    sys     0m16.733s

After:

    CPU utilization: 1.1%

    real    1m40.288s
    user    0m25.218s
    sys     0m16.566s

---

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

Backport: This improves the stability of our CI, so let's
          backport it to all supported versions.

- (cherry picked from commit 628e74f157)

- (cherry picked from commit ac4af5f461)

- (cherry picked from commit c5239edf2a)

- (cherry picked from commit a256ba7de0)

- (cherry picked from commit f83f911bae)

Parent PR: #28602

Closes scylladb/scylladb#28622

* github.com:scylladb/scylladb:
  test: cluster: Reduce wait time in test_sync_point
  test: cluster: Fix test_sync_point
  test: cluster: Await sync points asynchronously
  test: cluster: Create sync points asynchronously
  test: cluster: Fetch hint metrics asynchronously
2026-02-18 12:46:35 +02:00
Anna Mikhlin
7f5c2768d1 .github/workflows: ignore quoted comments for trigger CI
prevent CI from being triggered when trigger-ci command appears inside
quoted (>) comment text

Fixes: https://scylladb.atlassian.net/browse/RELENG-271

Closes scylladb/scylladb#28604

(cherry picked from commit 33cf97d688)

Closes scylladb/scylladb#28651
2026-02-18 12:45:45 +02:00
Ernest Zaslavsky
8edc0e6df9 s3_client: limit multipart upload concurrency
Prevent launching hundreds or thousands of fibers during multipart uploads
by capping concurrent part submissions to 16.

Closes scylladb/scylladb#28554

(cherry picked from commit 034c6fbd87)

Closes scylladb/scylladb#28666
2026-02-18 12:45:11 +02:00
Patryk Jędrzejczak
63abb3e6cd test: test_restart_leaving_replica_during_cleanup: reconnect driver after restart
The test can currently fail like this:
```
>           await cql.run_async(f"ALTER TABLE {ks}.test WITH tablets = {{'min_tablet_count': 1}}")
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.158.27.9:9042 datacenter1>: <Error from server: code=0000 [Server error] message="Failed to apply group 0 change due to concurrent modification">})
```
The following happens:
- node A is restarted and becomes the group0 leader,
- the driver sends the ALTER TABLE request to node B,
- the request hits group 0 concurrent modification error 10 times and fails
  because node A performs tablet migrations at the the same time.

What is unexpected is that even though the driver session uses the default
retry policy, the driver doesn't retry the request on node A. The request
is guaranteed to succeed on node A because it's the only node adding group0
entries.

The driver doesn't retry the request on node A because of a missing
`wait_for_cql_and_get_hosts` call. We add it in this commit. We also reconnect
the driver just in case to prevent hitting scylladb/python-driver#295.

Moreover, we can revert the workaround from
4c9efc08d8, as the fix from this commit also
prevents DROP KEYSPACE failures.

The commit has been tested in byo with `_concurrent_ddl_retries{0}` to
verify that node A really can't hit group 0 concurrent modification error
and always receives the ALTER TABLE request from the driver. All 300 runs in
each build mode passed.

Fixes #25938

Closes scylladb/scylladb#28632

(cherry picked from commit 0693091aff)

Closes scylladb/scylladb#28672
2026-02-18 12:43:31 +02:00
Calle Wilund
aef06a78b5 commitlog: Always abort replenish queue on loop exit
Fixes #28678

If replenish loop exits the sleep condition, with an empty queue,
when "_shutdown" is already set, a waiter might get stuck, unsignalled
waiting for segments, even though we are exiting.

Simply move queue abort to always be done on loop exit.

Closes scylladb/scylladb#28679

(cherry picked from commit ab4e4a8ac7)

Closes scylladb/scylladb#28692
2026-02-18 12:42:48 +02:00
Ernest Zaslavsky
07314298f9 s3_client: add more constrains to the calc_part_size
Enforce more checks on part size and object size as defined in
"Amazon S3 multipart upload limits", see
https://docs.aws.amazon.com/AmazonS3/latest/userguide/qfacts.html and
https://docs.aws.amazon.com/AmazonS3/latest/userguide/UsingObjects.html

(cherry picked from commit 960adbb439)
2026-02-18 09:40:50 +00:00
Ernest Zaslavsky
dd35777c29 s3_client: add tests for calc_part_size
Introduce tests that validate the corrected multipart part-size
calculation, including boundary conditions and error cases.

(cherry picked from commit 6280cb91ca)
2026-02-18 09:40:49 +00:00
Ernest Zaslavsky
387849a8a3 s3_client: correct multipart part-size logic to respect 10k limit
The previous calculation could produce more than 10,000 parts for large
uploads because we mixed values in bytes and MiB when determining the
part size. This could result in selecting a part size that still
exceeded the AWS multipart upload limit. The updated logic now ensures
the number of parts never exceeds the allowed maximum.

This change also aligns the implementation with the code comment: we
prefer a 50 MiB part size because it provides the best performance, and
we use it whenever it fits within the 10,000-part limit. If it does not,
we increase the part size (in bytes, aligned to MiB) to stay within the
limit.

(cherry picked from commit 289e910cec)
2026-02-18 09:40:49 +00:00
Piotr Dulikowski
f12ac5f2d2 Merge '[Backport 2025.4] vector_search: Fix flaky vector_store_client_https_rewrite_ca_cert' from Scylladb[bot]
Most likely, the root cause of the flaky test was that the TLS handshake hung for an extended period (60s). This caused
the test case to fail because the ANN request duration exceeded the test case timeout.

The PR introduces two changes:

* Mitigation of the hanging TLS handshake: This issue likely occurred because the test performed certificate rewrites
simultaneously with ANN requests that utilize those certificates.
* Production code fix: This addresses a bug where the TLS handshake itself was not covered by the connection timeout.
Since tls::connect does not perform the handshake immediately, the handshake only occurs during the first write
operation, potentially bypassing connect timeout.

Fixes: #28012

Backport to 2026.01 and 2025.04 is needed, as these branches are also affected and may experience CI flakiness due to this test.

- (cherry picked from commit aef5ff7491)

- (cherry picked from commit 079fe17e8b)

Parent PR: #28617

Closes scylladb/scylladb#28642

* github.com:scylladb/scylladb:
  vector_search: Fix missing timeout on TLS handshake
  vector_search: test: Fix flaky cert rewrite test
2026-02-17 19:06:09 +01:00
Patryk Jędrzejczak
79a5d3c5b1 Merge '[Backport 2025.4] test: explicitly set compression algorithm in test_autoretrain_dict' from Scylladb[bot]
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: https://github.com/scylladb/scylladb/issues/28204
Backport: 2025.4, as test already failed there (and also backport to 2026.1 to make everything consistent).

- (cherry picked from commit e63cfc38b3)

- (cherry picked from commit 9ffa62a986)

Parent PR: #28625

Closes scylladb/scylladb#28665

* https://github.com/scylladb/scylladb:
  test: explicitly set compression algorithm in test_autoretrain_dict
  test: remove unneeded semicolons from python test
2026-02-17 10:21:43 +01:00
Andrzej Jackowski
c46ae2c2ab test: explicitly set compression algorithm in test_autoretrain_dict
When `test_autoretrain_dict` was originally written, the default
`sstable_compression_user_table_options` was `LZ4Compressor`. The
test assumed (correctly) that initially the compression doesn't use
a trained dictionary, and later in the test scenario, it changed
the algorithm to one with a dictionary.

However, the default `sstable_compression_user_table_options` is now
`LZ4WithDictsCompressor`, so the old assumption is no longer correct.
As a result, the assertion that data is initially not compressed well
may or may not fail depending on dictionary training timing.

To fix this, this commit explicitly sets `ZstdCompressor`
as the initial `sstable_compression_user_table_options`, ensuring that
the assumption that initial compression is without a dictionary
is always met.

Note: `ZstdCompressor` differs from the former default `LZ4Compressor`.
However, it's a better choice — the test aims to show the benefit of
using a dictionary, not the benefit of Zstd over LZ4 (and the test uses
ZstdWithDictsCompressor as the algorithm with the dictionary).

Fixes: scylladb/scylladb#28204
(cherry picked from commit 9ffa62a986)
2026-02-16 16:22:58 +00:00
Andrzej Jackowski
91bf817955 test: remove unneeded semicolons from python test
(cherry picked from commit e63cfc38b3)
2026-02-16 16:22:58 +00:00
Jenkins Promoter
bca5c3658e Update pgo profiles - aarch64 2026-02-15 04:32:32 +02:00
Karol Nowacki
23b6cb3f82 vector_search: Fix missing timeout on TLS handshake
Currently the TLS handshake in the vector search client does not have a timeout.
This is because tls::connect does not perform handshake itself; the handshake
is deferred until the first read/write operation is performed. This can lead to long
hangs on ANN requests.

This commit calls tls::check_session_is_resumed() after tls::connect
to force the handshake to happen immediately and to run under with_timeout.

(cherry picked from commit 079fe17e8b)
2026-02-13 21:24:06 +00:00
Karol Nowacki
38a80f00b8 vector_search: test: Fix flaky cert rewrite test
The test is flaky most likely because when TLS certificate rewrite
happens simultaneously with an ANN request, the handshake can hang for a
long time (~60s). This leads to a timeout in the test case.

This change introduces a checkpoint in the test so that it will
wait for the certificate rewrite to happen before sending an ANN request,
which should prevent the handshake from hanging and make the test more reliable.

Fixes: #28012
(cherry picked from commit aef5ff7491)
2026-02-13 21:24:05 +00:00
Dawid Mędrek
3e7602254a test: cluster: Reduce wait time in test_sync_point
If everything is OK, the sync point will not resolve with node 3 dead.
As a result, the waiting will use all of the time we allocate for it,
i.e. 30 seconds. That's a lot of time.

There's no easy way to verify that the sync point will NOT resolve, but
let's at least reduce the waiting to 3 seconds. If there's a bug, it
should be enough to trigger it at some point, while reducing the average
time needed for CI.

(cherry picked from commit f83f911bae)
2026-02-12 12:12:43 +00:00
Dawid Mędrek
2334f297f2 test: cluster: Fix test_sync_point
The test had a few shortcomings that made it flaky or simply wrong:

1. We were verifying that hints were written by checking the size of
   in-flight hints. However, that could potentially lead to problems
   in rare situations.

   For instance, if all of the hints failed to be written to disk, the
   size of in-flight hints would drop to zero, but creating a sync point
   would correspond to the empty state.

   In such a situation, we should fail immediately and indicate what
   the cause was.

2. A sync point corresponds to the hints that have already been written
   to disk. The number of those is tracked by the metric `written`.
   It's a much more reliable way to make sure that hints have been
   written to the commitlog. That ensures that the sync point we'll
   create will really correspond to those hints.

3. The auxiliary function `wait_for` used in the test works like this:
   it executes the passed callback and looks at the result. If it's
   `None`, it retries it. Otherwise, the callback is deemed to have
   finished its execution and no further retries will be attempted.

   Before this commit, we simply returned a bool, and so the code was
   wrong. We improve it.

Note that this fixes scylladb/scylladb#28203, which was a manifestation
of scylladb/scylladb#25879. We created a sync point that corresponded
to the empty state, and so it immediately resolved, even when node 3
was still dead.

Refs scylladb/scylladb#25879
Fixes scylladb/scylladb#28203

(cherry picked from commit a256ba7de0)
2026-02-12 12:12:43 +00:00
Dawid Mędrek
ebf8281b66 test: cluster: Await sync points asynchronously
There's a dedicated HTTP API for communicating with the cluster, so
let's use it instead of yet another custom solution.

(cherry picked from commit c5239edf2a)
2026-02-12 12:12:43 +00:00
Dawid Mędrek
e9ae597d35 test: cluster: Create sync points asynchronously
There's a dedicated HTTP API for communicating with the nodes, so let's
use it instead of yet another custom solution.

(cherry picked from commit ac4af5f461)
2026-02-12 12:12:43 +00:00
Dawid Mędrek
63d001e141 test: cluster: Fetch hint metrics asynchronously
There's a dedicated API for fetching metrics now. Let's use it instead
of developing yet another solution that's also worse.

(cherry picked from commit 628e74f157)
2026-02-12 12:12:43 +00:00
Nadav Har'El
4bd2dcf00d test/cqlpy: add reproducer for hidden Paxos table being shown by DESC
This patch adds a reproducer test showing issue #28183 - that when LWT
is used, hidden tables "...$paxos" are created but they are unexpectedly
shown by DESC TABLES, DESC SCHEMA and DESC KEYSPACE.

The new test was failing (in three places) on Scylla, as those internal
(and illegally-named) tables are listed, and passes on Cassandra
(which doesn't add hidden tables for LWT).

The commit also contains another test, which verifies if direct
description of paxos state table is wrapped in comment.

Refs #28183.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 9baaddb613)
2026-02-10 15:44:50 +01:00
Jenkins Promoter
c6283f3be2 Update ScyllaDB version to: 2025.4.4 2026-02-09 14:21:23 +02:00
Patryk Jędrzejczak
92d2bc4b47 Merge '[Backport 2025.4] Introduce TTL and retries to address resolution' from Scylladb[bot]
In production environments, we observed cases where the S3 client would repeatedly fail to connect due to DNS entries becoming stale. Because the existing logic only attempted the first resolved address and lacked a way to refresh DNS state, the client could get stuck in a failure loop.

Introduce RR TTL and connection failure retry to
- re-resolve the RR in a timely manner
- forcefully reset and re-resolve addresses
- add a special case when the TTL is 0 and the record must be resolved for every request

Fixes: CUSTOMER-96
Fixes: CUSTOMER-139

Should be backported to 2025.3/4 and 2026.1 since we already encountered it in the production clusters for 2025.3

- (cherry picked from commit bd9d5ad75b)

- (cherry picked from commit 359d0b7a3e)

- (cherry picked from commit ce0c7b5896)

- (cherry picked from commit 5b3e513cba)

- (cherry picked from commit 66a33619da)

- (cherry picked from commit 6eb7dba352)

- (cherry picked from commit a05a4593a6)

- (cherry picked from commit 3a31380b2c)

- (cherry picked from commit 912c48a806)

Parent PR: #27891

Closes scylladb/scylladb#28404

* https://github.com/scylladb/scylladb:
  connection_factory: includes cleanup
  dns_connection_factory: refine the move constructor
  connection_factory: retry on failure
  connection_factory: introduce TTL timer
  connection_factory: get rid of shared_future in dns_connection_factory
  connection_factory: extract connection logic into a member
  connection_factory: remove unnecessary `else`
  connection_factory: use all resolved DNS addresses
  s3_test: remove client double-close
2026-02-05 16:16:54 +01:00
Michał Hudobski
2bc978e358 auth: add CDC streams and timestamps to vector search permissions
It turns out that the cdc driver requires permissions to two additional system tables. This patch adds them to VECTOR_SEARCH_INDEXING and modifies the unit tests. The integration with vector store was tested manually, integration tests will be added in vector-store repository in a follow up PR.

Fixes: SCYLLADB-522

Closes scylladb/scylladb#28519

(cherry picked from commit 6b9fcc6ca3)

Closes scylladb/scylladb#28537
2026-02-05 08:59:45 +02:00
Patryk Jędrzejczak
0e78bec6e8 Merge '[Backport 2025.4] storage_service: set up topology properly in maintenance mode' from Scylladb[bot]
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`.

We also extend `test_maintenance_mode` to provide a reproducer for

Fixes #27988

This PR must be backported to all branches, as maintenance mode is
currently unusable everywhere.

- (cherry picked from commit a08c53ae4b)

- (cherry picked from commit 9d4a5ade08)

- (cherry picked from commit c92962ca45)

- (cherry picked from commit 408c6ea3ee)

- (cherry picked from commit 53f58b85b7)

- (cherry picked from commit 867a1ca346)

- (cherry picked from commit 6c547e1692)

- (cherry picked from commit 7e7b9977c5)

Parent PR: #28322

Closes scylladb/scylladb#28498

* https://github.com/scylladb/scylladb:
  test: test_maintenance_mode: enable maintenance mode properly
  test: test_maintenance_mode: shutdown cluster connections
  test: test_maintenance_mode: run with different keyspace options
  test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
  test: test_maintenance_mode: get rid of the conditional skip
  test: test_maintenance_mode: remove the redundant value from the query result
  storage_proxy: skip validate_read_replica in maintenance mode
  storage_service: set up topology properly in maintenance mode
2026-02-04 16:49:02 +01:00
Ernest Zaslavsky
b315641f8c connection_factory: includes cleanup
(cherry picked from commit 912c48a806)
2026-02-04 09:41:36 +02:00
Ernest Zaslavsky
a2b8d34c66 dns_connection_factory: refine the move constructor
Clean up the awkward move constructor that was declared in the header
but defaulted in a separate compilation unit, improving clarity and
consistency.

(cherry picked from commit 3a31380b2c)
2026-02-04 09:41:36 +02:00
Ernest Zaslavsky
70ded86636 connection_factory: retry on failure
If connecting to a provided address throws, renew the address list and
retry once (and only once) before giving up.

(cherry picked from commit a05a4593a6)
2026-02-04 09:41:36 +02:00
Ernest Zaslavsky
2994c71eeb connection_factory: introduce TTL timer
Add a TTL-based timer to connection_factory to automatically refresh
resolved host name addresses when they expire.

(cherry picked from commit 6eb7dba352)
2026-02-04 09:41:36 +02:00
Ernest Zaslavsky
c2dd578b8b connection_factory: get rid of shared_future in dns_connection_factory
Move state management from dns_connection_factory into state class
itself to encapsulate its internal state and stop managing it from the
`dns_connection_factory`

(cherry picked from commit 66a33619da)
2026-02-04 09:41:35 +02:00
Ernest Zaslavsky
26a5ae27c5 connection_factory: extract connection logic into a member
extract connection logic into a private member function to make it reusable

(cherry picked from commit 5b3e513cba)
2026-02-04 09:41:35 +02:00
Ernest Zaslavsky
2a1a8f404c connection_factory: remove unnecessary else
(cherry picked from commit ce0c7b5896)
2026-02-04 09:41:35 +02:00
Ernest Zaslavsky
2b49f19a74 connection_factory: use all resolved DNS addresses
Improve dns_connection_factory to iterate over all resolved
addresses instead of using only the first one.

(cherry picked from commit 359d0b7a3e)
2026-02-04 09:41:35 +02:00
Ernest Zaslavsky
897ce6f2e1 s3_test: remove client double-close
`test_chunked_download_data_source_with_delays` was calling `close()` on a client twice, remove the unnecessary call

(cherry picked from commit bd9d5ad75b)
2026-02-04 09:41:35 +02:00
Tomasz Grabiec
fc039c2be5 Merge '[Backport 2025.4] service: pass topology guard to RBNO' from Scylladb[bot]
Currently, raft-based node operations with streaming use topology guards, but repair-based don't.

Topology guards ensure that if a respective session is closed (the operation has finished), each leftover operation being a part of this session fails. Thanks to that we won't incorrectly assume that e.g. the old rpc received late belongs to the newly started operation. This is especially important if the operation involves writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759

No topology_guard in any version; needs backport to all versions

- (cherry picked from commit 3fe596d556)

- (cherry picked from commit 2be5ee9f9d)

Parent PR: #27839

Closes scylladb/scylladb#28298

* github.com:scylladb/scylladb:
  service: use session variable for streaming
  service: pass topology guard to RBNO
2026-02-03 11:47:41 +01:00
Tomasz Grabiec
3243952c47 Merge '[Backport 2025.4] load_stats: fix problem with load_stats refresh throwing no_such_column_family' from Scylladb[bot]
When the topology coordinator refreshes load_stats, it caches load_stats for every node. In case the node becomes unresponsive, and fresh load_stats can not be read from the node, the cached version of load_stats will be used. This is to allow the load balancer to have at least some information about the table sizes and disk capacities of the host.

During load_stats refresh, we aggregate the table sizes from all the nodes. This procedure calls db.find_column_family() for each table_id found in load_stats. This function will throw if the table is not found. This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time load_stats has been prepared on the host, and the time it is processed on the topology coordinator. This would also cause an exception in the refresh procedure.

This fixes this problem by checking if the table still exists.

Fixes: #28359

- (cherry picked from commit 71be10b8d6)

- (cherry picked from commit 92dbde54a5)

Parent PR: #28440

Closes scylladb/scylladb#28470

* github.com:scylladb/scylladb:
  test: add test and reproducer for load_stats refresh exception
  load_stats: handle dropped tables when refreshing load_stats
2026-02-03 11:33:57 +01:00
Patryk Jędrzejczak
dcffababbe test: test_maintenance_mode: enable maintenance mode properly
The same issue as the one fixed in
394207fd69.
This one didn't cause real problems, but it's still cleaner to fix it.

(cherry picked from commit 7e7b9977c5)
2026-02-03 11:33:53 +01:00
Patryk Jędrzejczak
4be6788083 test: test_maintenance_mode: shutdown cluster connections
Leaked connections are known to cause inter-test issues.

(cherry picked from commit 6c547e1692)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
043287ac77 test: test_maintenance_mode: run with different keyspace options
We extend the test to provide a reproducer for #27988 and to avoid
similar bugs in the future.

The test slows down from ~14s to ~19s on my local machine in dev
mode. It seems reasonable.

(cherry picked from commit 867a1ca346)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
2b984f07cf test: test_maintenance_mode: check that group0 is disabled by creating a keyspace
In the following commit, we make the rest run with multiple keyspaces,
and the old check becomes inconvenient. We also move it below to the
part of the code that won't be executed for each keyspace.

Additionally, we check if the error message is as expected.

(cherry picked from commit 53f58b85b7)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
59aee20992 test: test_maintenance_mode: get rid of the conditional skip
This skip has already caused trouble.
After 0668c642a2, the skip was always hit, and
the test was silently doing nothing. This made us miss #26816 for a long
time. The test was fixed in 222eab45f8, but we
should get rid of the skip anyway.

We increase the number of writes from 256 to 1000 to make the chance of not
finding the key on server A even lower. If that still happens, it must be
due to a bug, so we fail the test. We also make the test insert rows until
server A is a replica of one row. The expected number of inserted rows is
a small constant, so it should, in theory, make the test faster and cleaner
(we need one row on server A, so we insert exactly one such row).

It's possible to make the test fully deterministic, by e.g., hardcoding
the key and tokens of all nodes via `initial_token`, but I'm afraid it would
make the test "too deterministic" and could hide a bug.

(cherry picked from commit 408c6ea3ee)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
7f133e72af test: test_maintenance_mode: remove the redundant value from the query result
(cherry picked from commit c92962ca45)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
0df478e02e storage_proxy: skip validate_read_replica in maintenance mode
In maintenance mode, the local node adds only itself to the topology. However,
the effective replication map of a keyspace with tablets enabled contains all
tablet replicas. It gets them from the tablets map, not the topology. Hence,
`network_topology_strategy::sanity_check_read_replicas` hits
```
throw std::runtime_error(format("Requested location for node {} not in topology. backtrace {}", id, lazy_backtrace()));
```
for tablet replicas other than the local node.

As a result, all requests to a keyspace with tablets enabled and RF > 1 fail
in debug mode (`validate_read_replica` does nothing in other modes). We don't
want to skip maintenance mode tests in debug mode, so we skip the check in
maintenance mode.

We move the `is_debug_build()` check because:
- `validate_read_replicas` is a static function with no access to the config,
- we want the `!_db.local().get_config().maintenance_mode()` check to be
  dropped by the compiler in non-debug builds.

We also suppress `-Wunneeded-internal-declaration` with `[[maybe_unused]]`.

(cherry picked from commit 9d4a5ade08)
2026-02-03 11:33:52 +01:00
Patryk Jędrzejczak
3ccfed37f6 storage_service: set up topology properly in maintenance mode
We currently make the local node the only token owner (that owns the
whole ring) in maintenance mode, but we don't update the topology properly.
The node is present in the topology, but in the `none` state. That's how
it's inserted by `tm.get_topology().set_host_id_cfg(host_id);` in
`scylla_main`. As a result, the node started in maintenance mode crashes
in the following way in the presence of a vnodes-based keyspace with the
NetworkTopologyStrategy:
```
scylla: locator/network_topology_strategy.cc:207:
    locator::natural_endpoints_tracker::natural_endpoints_tracker(
    const token_metadata &, const network_topology_strategy::dc_rep_factor_map &):
    Assertion `!_token_owners.empty() && !_racks.empty()' failed.
```
Both `_token_owners` and `_racks` are empty. The reason is that
`_tm.get_datacenter_token_owners()` and
`_tm.get_datacenter_racks_token_owners()` called above filter out nodes
in the `none` state.

This bug basically made maintenance mode unusable in customer clusters.

We fix it by changing the node state to `normal`. We also update its
rack, datacenter, and shards count. Rack and datacenter are present in the
topology somehow, but there is nothing wrong with updating them again.
The shard count is also missing, so we better update it to avoid other
issues.

Fixes #27988

(cherry picked from commit a08c53ae4b)
2026-02-03 11:33:51 +01:00
Tomasz Grabiec
df1b3ede1c Merge '[Backport 2025.4] tablets: Make balancing disabling RPC preempt tablet transitions' from Scylladb[bot]
Disabling of balancing waits for topology state machine to become idle, to guarantee that no migrations are happening or will happen after the call returns. But it doesn't interrupt the scheduler, which means the call can take arbitrary amount of time. It may wait for tablet repair to be finished, which can take many hours.

We should do it via topology request, which will interrupt the tablet scheduler.

Enabling of balancing can be immediate.

Fixes https://github.com/scylladb/scylladb/issues/27647
Fixes #27210

- (cherry picked from commit ccdb301731)

- (cherry picked from commit ffa11d6a2d)

Parent PR: #27736

Closes scylladb/scylladb#28294

* github.com:scylladb/scylladb:
  test: Verify that repair doesn't block disabling of tablet load balancing
  tablets: Make balancing disabling call preempt tablet transitions
2026-02-03 10:29:37 +01:00
Pavel Emelyanov
cf99ef21e0 Update seastar submodule (assorted fixes for S3 client update)
* seastar 65f936e38...72df0a926 (2):
  > net: expose DNS TTL via net::hostent
  > http: add virtual close() to connection_factory

refs SCYLLADB-435

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#28482
2026-02-03 11:41:30 +03:00
Michał Jadwiszczak
121bbee2e9 cql3/statements/describe_statement: hide paxos state tables
Paxos state tables are internal tables fully managed by Scylla
and they shouldn't be exposed to the user nor they shouldn't be backed up.

This commit hides those kind of tables from all listings and if such table
is directly described with `DESC ks."tbl$paxos"`, the description is generated
withing a comment and a note for the user is added.

Fixes scylladb/scylladb#28183

(cherry picked from commit f89a8c4ec4)
2026-02-02 23:31:17 +00:00
Tomasz Grabiec
ab4a4ad03f test: Verify that repair doesn't block disabling of tablet load balancing
Refs #27647

(cherry picked from commit ffa11d6a2d)
2026-02-02 21:26:18 +01:00
Tomasz Grabiec
3c3e5d0a0c tablets: Make balancing disabling call preempt tablet transitions
This patch modifies RESTful API handler which disables tablet
balancing to use topology request to wait for already running tablet
transitions. Before, it was just waiting for topology to be idle, so
it could wait much longer than necessary, also for operations which
are not affected by the flag, like repair. And repair can take hours.

New request type is introduced for this synchronization: noop_request.
It will preempt the tablet scheduler, and when the request executes,
we know all later tablet transitions will respect the "balancing
disabled" flag, and only things which are unuaffected by the flag,
like repair, will be scheduled.

Fixes #27647

(cherry picked from commit ccdb301731)
2026-02-02 21:25:11 +01:00
Aleksandra Martyniuk
99af3d7bae service: use session variable for streaming
Use session that was retrieved at the beginning of the handler for
node operations with streaming to ensure that the session id won't
change in between.

(cherry picked from commit 2be5ee9f9d)
2026-02-02 17:38:00 +01:00
Aleksandra Martyniuk
1387113ff0 service: pass topology guard to RBNO
Currently, raft-based node operations with streaming use topology
guards, but repair-based don't.

Topology guards ensure that if a respective session is closed
(the operation has finished), each leftover operation being a part
of this session fails. Thanks to that we won't incorrectly assume
that e.g. the old rpc received late belongs to the newly started
operation. This is especially important if the operation involves
writes.

Pass a topology_guard down from raft_topology_cmd_handler to repair
tasks. Repair tasks already support topology guards.

Fixes: https://github.com/scylladb/scylladb/issues/27759
(cherry picked from commit 3fe596d556)
2026-02-02 17:37:37 +01:00
Ferenc Szili
22e0bafaa7 test: add test and reproducer for load_stats refresh exception
This patch adds a test and reproducer for the issue where the load_stats
refresh procedure throws exceptions if any of the tables have been
dropped since load_stats was produced.

(cherry picked from commit 92dbde54a5)
2026-02-02 16:35:08 +01:00
Jenkins Promoter
6698ee205f Update pgo profiles - aarch64 2026-02-01 04:32:56 +02:00
Jenkins Promoter
5942e636ea Update pgo profiles - x86_64 2026-02-01 04:03:33 +02:00
Ferenc Szili
396b5b63ca load_stats: handle dropped tables when refreshing load_stats
When the topology coordinator refreshes load_stats, it caches load_stats
for every node. In case the node becomes unresponsive, and fresh
load_stats can not be read from the node, the cached version of
load_stats will be used. This is to allow the load balancer to
have at least some information about the table sizes and disk capacities
of the host.

During load_stats refresh, we aggregate the table sizes from all the
nodes. This procedure calls db.find_column_family() for each table_id
found in load_stats. This function will throw if the table is not found.
This will cause load_stats refresh to fail.

It is also possible for a table to have been dropped between the time
load_stats has been prepared on the host, and the time it is processed
on the topology coordinator. This would also cause an exception in the
refresh procedure.

This patch fixes this problem by checking if the table still exists.

(cherry picked from commit 71be10b8d6)
2026-02-01 00:33:46 +00:00
Łukasz Paszkowski
960c952eb5 storage/test_out_of_space_prevention.py: Fix async/await bugs
- Add missing await keywords for async operations on s2_log.wait_for()
  and coord_log.wait_for()
- Fix incorrect regex: "compaction .* Split {cf}" → "compaction.*Split {cf}"
- The commit https://github.com/scylladb/scylladb/commit/f7324a4 demoted
  compaction start/end log messages to debug level. Hence add
  compaction=debug log messages to the following tests:
    test_split_compaction_not_triggered
    test_node_restart_while_tablet_split
    test_repair_failure_on_split_rejection

Fixes https://github.com/scylladb/scylladb/issues/27931

Closes scylladb/scylladb#27932

(cherry picked from commit 76b84b71d1)

Closes scylladb/scylladb#27949
2026-01-31 20:59:03 +02:00
Aleksandra Martyniuk
196143b830 service: node_ops: remove coroutine::lambda wrappers
In storage_service::raft_topology_cmd_handler we pass a lambda
wrapped in coroutine::lambda to a function that creates streaming_task_impl.
The lambda is kept in streaming_task_impl that invokes it in its run
method.

The lambda captures may be destroyed before the lambda is called, leading
to use after free.

Do not wrap a lambda passed to streaming_task_impl into coroutine::lambda.
Use this auto dissociate the lambda lifetime from the calling statement.

Fixes: https://github.com/scylladb/scylladb/issues/28200.

Closes scylladb/scylladb#28201

(cherry picked from commit 65cba0c3e7)

Closes scylladb/scylladb#28244
2026-01-30 16:12:11 +02:00
Botond Dénes
3573535167 Merge '[Backport 2025.4] schema: Apply sstable_compression_user_table_options to CQL aux and Alternator tables' from Scylladb[bot]
In PR 5b6570be52 we introduced the config option `sstable_compression_user_table_options` to allow adjusting the default compression settings for user tables. However, the new option was hooked into the CQL layer and applied only to CQL base tables, not to the whole spectrum of user tables: CQL auxiliary tables (materialized views, secondary indexes, CDC log tables), Alternator base tables, Alternator auxiliary tables (GSIs, LSIs, Streams).

This gap also led to inconsistent default compression algorithms after we changed the option’s default algorithm from LZ4 to LZ4WithDicts (adf9c426c2).

This series introduces a general “schema initializer” mechanism in `schema_builder` and uses it to apply the default compression settings uniformly across all user tables. This ensures that all base and aux tables take their default compression settings from config.

Fixes #26914.

Backport justification: LZ4WithDicts is the new default since 2025.4, but the config option exists since 2025.2. Based on severity, I suggest we backport only to 2025.4 to maintain consistency of the defaults.

- (cherry picked from commit 4ec7a064a9)

- (cherry picked from commit 76b2d0f961)

- (cherry picked from commit 5b4aa4b6a6)

- (cherry picked from commit d5ec66bc0c)

- (cherry picked from commit 1e37781d86)

- (cherry picked from commit 7fa1f87355)

Parent PR: #27204

Closes scylladb/scylladb#28305

* github.com:scylladb/scylladb:
  db/config: Update sstable_compression_user_table_options description
  schema: Add initializer for compression defaults
  schema: Generalize static configurators into schema initializers
  schema: Initialize static properties eagerly
  db: config: Add accessor for sstable_compression_user_table_options
  test: Check that CQL and Alternator tables respect compression config
  test/cqlpy: test compression setting for auxiliary table
  test/alternator: tests for schema of Alternator table
2026-01-30 16:10:35 +02:00
Ernest Zaslavsky
c9fe14b79d aws_error: handle all restartable nested exception types
Previously we only inspected std::system_error inside
std::nested_exception to support a specific TLS-related failure
mode. However, nested exceptions may contain any type, including
other restartable (retryable) errors. This change unwraps one
nested exception per iteration and re-applies all known handlers
until a match is found or the chain is exhausted.

Closes scylladb/scylladb#28240

(cherry picked from commit cb2aa85cf5)

Closes scylladb/scylladb#28344
2026-01-30 16:08:59 +02:00
Avi Kivity
498f74f53d test/cqlpy: restore LWT tests marked XFAIL for tablets
Commit 0156e97560 ("storage_proxy: cas: reject for
tablets-enabled tables") marked a bunch of LWT tests as
XFAIL with tablets enabled, pending resolution of #18066.
But since that event is now in the past, we undo the XFAIL
markings (or in some cases, use an any-keyspace fixture
instead of a vnodes-only fixture).

Ref #18066.

Closes scylladb/scylladb#28336

(cherry picked from commit ec70cea2a1)

[avi: skip counters-with-tablets test as not supported in 2025.4]

Closes scylladb/scylladb#28364
2026-01-30 16:07:27 +02:00
Patryk Jędrzejczak
538295e97b test: test_gossiper_orphan_remover: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await log.wait_for("finished do_send_ack2_msg")` above guarantees that the
node has started the gossip shadow round, which happens after starting the REST
API.

Fixes #28385

Closes scylladb/scylladb#28388

(cherry picked from commit a2c1569e04)

Closes scylladb/scylladb#28416
2026-01-29 11:27:37 +01:00
Nikos Dragazis
371cff0e06 db/config: Update sstable_compression_user_table_options description
Clarify what "user table" means.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 7fa1f87355)
2026-01-28 12:42:10 +02:00
Nikos Dragazis
914d3f845a schema: Add initializer for compression defaults
In PR 5b6570be52 we introduced the config option
`sstable_compression_user_table_options` to allow adjusting the default
compression settings for user tables. However, the new option was hooked
into the CQL layer and applied only to CQL base tables, not to the whole
spectrum of user tables: CQL auxiliary tables (materialized views,
secondary indexes, CDC log tables), Alternator base tables, Alternator
auxiliary tables (GSIs, LSIs, Streams).

Fix this by moving the logic into the `schema_builder` via a schema
initializer. This ensures that the default compression settings are
applied uniformly regardless of how the table is created, while also
keeping the logic in a central place.

Register the initializer at startup in all executables where schemas are
being used (`scylla_main()`, `scylla_sstable_main()`, `cql_test_env`).

Finally, remove the ad-hoc logic from `create_table_statement`
(redundant as of this patch), remove the xfail markers from the relevant
tests and adjust `test_describe_cdc_log_table_create_statement` to
expect LZ4WithDicts as the default compressor.

Fixes #26914.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 1e37781d86)
2026-01-28 12:42:10 +02:00
Nikos Dragazis
b6deff2547 schema: Generalize static configurators into schema initializers
Extend the `static_configurator` mechanism to support initialization of
arbitrary schema properties, not only static ones, by passing a
`schema_builder` reference to the configurator interface.

As part of this change, rename `static_configurator` to
`schema_initializer` to better reflect its broader responsibility.

Add a checkpoint/restore mechanism to allow de-registering an
initializer (useful for testing; will be used in the next patch).

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit d5ec66bc0c)
2026-01-28 12:42:10 +02:00
Nikos Dragazis
a817aee457 schema: Initialize static properties eagerly
Schemas maintain a set of so-called "static properties". These are not
user-visible schema properties; they are internal values carried by
in-memory `schema` objects for convenience (349bc1a9b6,
https://github.com/scylladb/scylladb/pull/13170#issuecomment-1469848086).

Currently, the initialization of these properties happens when a
`schema_builder` builds a schema (`schema_builder::build()`), by
invoking all registered "static configurators".

This patch moves the initialization of static properties into the
`schema_builder` constructor. With this change, the builder initializes
the properties once, stores them in a data member, and reuses them for
all schema objects that it builds. This doesn't affect correctness as
the values produced by static configurators are "static" by
nature; they do not depend on runtime state.

In the next patch, we will replace the "static configurator" pattern
with a more general pattern that also supports initialization of regular
schema properties, not just static ones. Regular properties cannot be
initialized in `build()` because users may have already explicitly set
values via setters, and there is no way to distinguish between default
values and explicitly assigned ones.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 5b4aa4b6a6)
2026-01-28 12:42:10 +02:00
Nikos Dragazis
001581c69c db: config: Add accessor for sstable_compression_user_table_options
The `sstable_compression_user_table_options` config option determines
the default compression settings for user tables.

In patch 2fc812a1b9, the default value of this option was changed from
LZ4 to LZ4WithDicts and a fallback logic was introduced during startup
to temporarily revert the option to LZ4 until the dictionary compression
feature is enabled.

Replace this fallback logic with an accessor that returns the correct
settings depending on the feature flag. This is cleaner and more
consistent with the way we handle the `sstable_format` option, where the
same problem appears (see `get_preferred_sstable_version()`).

As a consequence, the configuration option must always be accessed
through this accessor. Add a comment to point this out.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 76b2d0f961)
2026-01-28 12:42:07 +02:00
Nikos Dragazis
80e064d61c test: Check that CQL and Alternator tables respect compression config
In patches 11f6a25d44 and 7b9428d8d7 we added tests to verify that
auxiliary tables for both CQL and Alternator have the same default
compression settings as their base tables. These tests do not check
where these defaults originate from; they just verify that they are
consistent.

Add some more tests to verify the actual source of the defaults, which
is expected to be the `sstable_compression_user_table_options`
from the configuration. Unlike the previous tests, these tests require
dedicated Scylla instances with custom configuration, so they must be
placed under `test/cluster/`.

Mark them as xfail-ing. The marker will be removed later in this series.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 4ec7a064a9)
2026-01-27 19:33:35 +02:00
Nadav Har'El
cac8096e41 test/cqlpy: test compression setting for auxiliary table
In the previous patch we noticed that although recently (commit adf9c42,
Refs #26610) we changed the default sstable compressor from LZ4Compressor
to LZ4WithDictsCompressor, this change was only applied to CQL, not to
Alternator.

In this patch we add tests that demonstrate that it's even worse - the
new compression only applies to CQL's *base* table - all the "auxiliary"
tables -
        * Materialized views
        * Secondary index's materialized views
        * CDC log tables

all still have the old LZ4Compressor, different from the base table's
default compressor.

The new test fails on Scylla, reproducing #26914, and passes on
Cassandra (on Cassandra, we only compare the materialized view table,
because SI and CDC is implemented differently).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 7b9428d8d7)
2026-01-27 19:33:21 +02:00
Nadav Har'El
b395458af2 test/alternator: tests for schema of Alternator table
This patch introduces a new test that exposed a previously unknown bug,
Refs #26914:

Recently we saw a lot of patches that change how we create new schemas
(keyspaces and tables), sometimes changing various long-time defaults.
We started to worry that perhaps some of these defaults were applied only
to CQL and not to Alternator. For example, in Refs #26307 we wondered if
perhaps the default "speculative_retry" option is different in Alternator
than in CQL.

This patch includes a new test file test/alternator/test_cql_schema.py,
with tests for verifying how Alternator configures the underlying tables
it creates. This test shows that the "speculative_retry" doesn't have
this suspected bug - it defaults to "99.0PERCENTILE" in both CQL and
Alternator. But unfortunately, we do have this bug with the "compression"
option:

It turns out that recently (commit adf9c42, Refs #26610) we changed the
default sstable compressor from LZ4Compressor to LZ4WithDictsCompressor,
but the change was only applied to CQL, not Alternator. So the test that
"compression" is the same in both fails - and marked "xfails" and
I created a new issue to track it - #26914.

Another test verifies that Alternators "auxiliary" tables - holding
GSIs, LSIs and Streams - have the same default properties as the base
table. This currently seems to hold (there is no bug).

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 11f6a25d44)
2026-01-27 19:32:27 +02:00
Yaron Kaikov
2c4fef8638 .github/workflows/backport-pr-fixes-validation: support Atlassian URL format in backport PR fixes validation
Add support for matching full Atlassian JIRA URLs in the format
https://scylladb.atlassian.net/browse/SCYLLADB-400 in addition to
the bare JIRA key format (SCYLLADB-400).

This makes the validation more flexible by accepting both formats
that developers commonly use when referencing JIRA issues.

Fixes: https://github.com/scylladb/scylladb/issues/28373

Closes scylladb/scylladb#28374

(cherry picked from commit 3f10f44232)

Closes scylladb/scylladb#28393
2026-01-27 16:05:34 +02:00
Avi Kivity
2a457484d2 Merge '[Backport 2025.4] test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection' from Scylladb[bot]
The storage_proxy::stop() is not called by main (it is commented out due to #293), so the corresponding message injection is never hit. When the test releases paxos_state_learn_after_mutate, shutdown may already be in progress or even completed by the time we try to trigger the storage_proxy::stop injection, which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection. The injection is not required for test correctness. Shutdown must wait for the background LWT learn to finish, which is released via the paxos_state_learn_after_mutate injection. The shutdown process blocks on in-flight HTTP requests through seastar::httpd::http_server::stop and its _task_gate, so the HTTP request that releases paxos_state_learn_after_mutate is guaranteed to complete before the node is shut down.

Fixes scylladb/scylladb#28260

backport: 2025.4, the `test_lwt_shutdown` test was introduced in this version

- (cherry picked from commit f5ed3e9fea)

- (cherry picked from commit c45244b235)

Parent PR: #28315

Closes scylladb/scylladb#28331

* github.com:scylladb/scylladb:
  storage_proxy: drop stop() method
  test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
2026-01-26 12:34:35 +02:00
Petr Gusev
a164887f15 storage_proxy: drop stop() method
It's not called by main.cc and can be confusing.

(cherry picked from commit c45244b235)
2026-01-23 19:24:06 +00:00
Petr Gusev
69a24bb933 test_lwt_shutdown: fix flakiness by removing storage_proxy::stop injection
storage_proxy::stop() is not called by main (it is commented out due to #293),
so the corresponding message injection is never hit. When the test releases
paxos_state_learn_after_mutate, shutdown may already be in progress or even
completed by the time we try to trigger the storage_proxy::stop injection,
which makes the test flaky.

Fix this by completely removing the storage_proxy::stop injection.
The injection is not required for test correctness. Shutdown must wait for the
background LWT learn to finish, which is released via the
paxos_state_learn_after_mutate injection.

The shutdown process blocks on in-flight api HTTP requests through
seastar::httpd::http_server::stop and its _task_gate, so the
shutdown will not prevent the HTTP request that released the
paxos_state_learn_after_mutate from completing successfully.

Fixes scylladb/scylladb#28260

(cherry picked from commit f5ed3e9fea)
2026-01-23 19:24:06 +00:00
Szymon Wasik
9c01dd8d49 Add vector search documentation links to CQL docs
This patch adds links to the Vector Search documentation that is hosted
together with Scylla Cloud docs to the CQL documentation.
It also make the note about supported capabilities consistent and
removes the experimental label as the feature is GAed.

Fixes: SCYLLADB-371

Closes scylladb/scylladb#28312

(cherry picked from commit 927aebef37)

Closes scylladb/scylladb#28319
2026-01-23 12:00:22 +01:00
Patryk Jędrzejczak
d4277c95e8 test: test_zero_token_nodes_multidc: properly handle reads with CL=LOCAL_ONE
The test is currently flaky. It incorrectly assumes that a read with
CL=LOCAL_ONE will see the data inserted by a preceding write with
CL=LOCAL_ONE in the same datacenter with RF=2.

The same issue has already been fixed for CL=ONE in
21edec1ace. The difference is that
for CL=LOCAL_ONE, only dc1 is problematic, as dc2 has RF=1.

We fix the issue for CL=LOCAL_ONE by skipping the check for dc1.

Fixes #28253

The fix addresses CI flakiness and only changes the test, so it
should be backported.

Closes scylladb/scylladb#28274

(cherry picked from commit 1f0f694c9e)

Closes scylladb/scylladb#28304
2026-01-22 18:22:05 +01:00
Patryk Jędrzejczak
a247e19f56 test: test_raft_recovery_during_join: get host ID of the bootstrapping node before it crashes
The test is currently flaky. It tries to get the host ID of the bootstrapping
node via the REST API after the node crashes. This can obviously fail. The
test usually doesn't fail, though, as it relies on the host ID being saved
in `ScyllaServer._host_id` at this point by `ScyllaServer.try_get_host_id()`
repeatedly called in `ScyllaServer.start()`. However, with a very fast crash
and unlucky timings, no such call may succeed.

We deflake the test by getting the host ID before the crash. Note that at this
point, the bootstrapping node must be serving the REST API requests because
`await coordinator_log.wait_for("delay_node_bootstrap: waiting for message")`
above guarantees that the node has submitted the join topology request, which
happens after starting the REST API.

Fixes #28227

Closes scylladb/scylladb#28233

(cherry picked from commit e503340efc)

Closes scylladb/scylladb#28310
2026-01-22 18:19:36 +01:00
Botond Dénes
62f399d8db db/row_cache: make_nonpopulating_reader(): pass cache tracker to snapshot
The API contract in partition_version.hh states that when dealing with
evictable entries, a real cache tracker pointer has to be passed to all
methods that ask for it. The nonpopulating reader violates this, passing
a nullptr to the snapshot. This was observed to cause a crash when a
concurrent cache read accessed the snapshot with the null tracker.

A reproducer is included which fails before and passes after the fix.

Fixes: #26847

Closes scylladb/scylladb#28163

(cherry picked from commit a53f989d2f)

Closes scylladb/scylladb#28279
2026-01-22 12:38:00 +02:00
Jenkins Promoter
ade92555c3 Update ScyllaDB version to: 2025.4.3 2026-01-21 23:16:44 +02:00
Szymon Wasik
9d9ee8ad68 Improve documentation of vector search configuration parameters.
This patch adds separate group for vector search parameters in the
documentation and fixes small typos and formatting.

Fixes: SCYLLADB-77.

Closes scylladb/scylladb#27385

(cherry picked from commit 4f803aad22)

Closes scylladb/scylladb#27424
2026-01-21 13:21:32 +02:00
Anna Stuchlik
ea222186bc doc: fix the default compaction strategy for Materialized Views
Fixes https://github.com/scylladb/scylladb/issues/24483

Closes scylladb/scylladb#27725

(cherry picked from commit 84e9b94503)

Closes scylladb/scylladb#28285
2026-01-21 06:38:01 +02:00
Botond Dénes
e8f5ac5fb6 reader_concurrency_semaphore: improve handling of base resources
reader_permit::release_base_resources() is a soft evict for the permit:
it releases the resources aquired during admission. This is used in
cases where a single process owns multiple permits, creating a risk for
deadlock, like it is the case for repair. In this case,
release_base_resources() acts as a manual eviction mechanism to prevent
permits blockings each other from admission.

Recently we found a bad interaction between release_base_resources() and
permit eviction. Repair uses both mechanism: it marks its permits as
inactive and later it also uses release_base_resources(). This partice
might be worth reconsidering, but the fact remains that there is a bug
in the reader permit which causes the base resources to be released
twice when release_base_resources() is called on an already evicted
permit. This is incorrect and is fixed in this patch.

Improve release_base_resources():
* make _base_resources const
* move signal call into the if (_base_resources_consumed()) { }
* use reader_permit::impl::signal() instead of
  reader_concurrency_semaphore::signal()
* all places where base resources are released now call
  release_base_resources()

A reproducer unit test is added, which fails before and passes after the
fix.

Fixes: #28083

Closes scylladb/scylladb#28155

(cherry picked from commit b7bc48e7b7)

Closes scylladb/scylladb#28245
2026-01-21 06:37:38 +02:00
Botond Dénes
ee59610b9b Merge '[Backport 2025.4] The system_replicated_keys should be mark as a system keyspace' from Scylladb[bot]
This PR marks system_replicated_keys as a system keyspace.
It was missing when the keyspace was added.

A side effect of that is that metrics that are not supposed to be reported are.
Fixes #27903

- (cherry picked from commit 83c1103917)

- (cherry picked from commit c6d1c63ddb)

Parent PR: #27954

Closes scylladb/scylladb#28237

* github.com:scylladb/scylladb:
  distributed_loader: system_replicated_keys as system keyspace
  replicated_key_provider: make KSNAME public
2026-01-21 06:36:43 +02:00
Michał Hudobski
4af195c866 cql: fail with a better error when null vector is passed to ann query
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").

This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.

Fixes: VECTOR-257

Closes scylladb/scylladb#26510

(cherry picked from commit 541b52cdbf)

Closes scylladb/scylladb#28052
2026-01-21 06:35:58 +02:00
Michał Hudobski
9a0849ef36 vector search, paging: add test for paging warnings
We add a test that validates that indexed queries
do not throw a warning related to vector search paging

Fixes: SCYLLADB-248

Closes scylladb/scylladb#28077

(cherry picked from commit c8aa49b196)

Closes scylladb/scylladb#28138
2026-01-20 12:57:54 +02:00
Ernest Zaslavsky
177985a69d aws_error: fix nested exception handling
The loop that unwraps nested exception, rethrows nested exception and saves pointer to the temporary std::exception& inner on stack, then continues. This pointer is, thus, pointing to a released temporary

Closes scylladb/scylladb#28143

(cherry picked from commit 829bd9b598)

Closes scylladb/scylladb#28243
2026-01-20 11:22:32 +01:00
Patryk Jędrzejczak
8b15975cb8 test: test_group0_schema_versioning: wait for schema sync in system.local
`test_schema_versioning_with_recovery` is currently flaky. It performs
a write with CL=ALL and then checks if the schema version is the same on
all nodes by calling `verify_table_versions_synced`. All nodes are expected
to sync their schema before handling the replica write. The node in
RECOVERY mode should do it through a schema pull, and other nodes should do
it through a group 0 read barrier.

The problem is in `verify_local_schema_versions_synced` that compares the
schema versions in `system.local`. The node in RECOVERY mode updates the
schema version in `system.local` after it acknowledges the replica write
as completed. Hence, the check can fail.

We fix the problem by making the function wait until the schema versions
match.

Note that RECOVERY mode is about to be retired together with the whole
gossip-based topology in 2026.2. So, this test is about to be deleted.
However, we still want to fix it, so that it doesn't bother us in older
branches.

Fixes #23803

Closes scylladb/scylladb#28114

(cherry picked from commit 6b5923c64e)

Closes scylladb/scylladb#28178
2026-01-19 16:35:49 +01:00
Amnon Heiman
1d8089408a distributed_loader: system_replicated_keys as system keyspace
This patch adds system_replicated_keys to the list of known system
keyspaces.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit c6d1c63ddb)
2026-01-19 14:59:26 +00:00
Amnon Heiman
b473736cd6 replicated_key_provider: make KSNAME public
Move KSNAME constant from internal static to public member of
replicated_key_provider_factory class.

It will be used to identify it as a system keyspace.

Signed-off-by: Amnon Heiman <amnon@scylladb.com>
(cherry picked from commit 83c1103917)
2026-01-19 14:59:26 +00:00
Gleb Natapov
4e4bfee41e topology coordinator: complete pending operation for a replaced node
A replaced node may have pending operation on it. The replace operation
will move the node into the 'left' state and the request will never be
completed. More over the code does not expect left node to have a
request. It will try to process the request and will crash because the
node for the request will not be found.

The patch checks is the replaced node has peening request and completes
it with failure. It also changes topology loading code to skip requests
for nodes that are in a left state. This is not strictly needed, but
makes the code more robust.

Fixes #27990

Closes scylladb/scylladb#28009

(cherry picked from commit bee5f63cb6)

Closes scylladb/scylladb#28180
2026-01-19 09:42:20 +02:00
Asias He
09da47a42c repair: Fix sstable_list_to_mark_as_repaired with multishard writer
It was obseved:

```
test_repair_disjoint_row_2nodes_diff_shard_count was spuriously failing due to
segfault.

backtrace pointed to a failure when allocating an object from the chain of
freed objects, which indicates memory corruption.

(gdb) bt
    at ./seastar/include/seastar/core/shared_ptr.hh:275
    at ./seastar/include/seastar/core/shared_ptr.hh:430
Usual suspect is use-after-free, so ran the reproducer in the sanitize mode,
which indicated shared ptr was being copied into another cpu through the
multi shard writer:

seastar - shared_ptr accessed on non-owner cpu, at: ...
--------
seastar::smp_message_queue::async_work_item<mutation_writer::multishard_writer::make_shard_writer...

```

The multishard writer itself was fine, the problem was in the streaming consumer
for repair copying a shared ptr. It could work fine with same smp setting, since
there will be only 1 shard in the consumer path, from rpc handler all the way
to the consumer. But with mixed smp setting, the ptr would be copied into the
cpus involved, and since the shared ptr is not cpu safe, the refcount change
can go wrong, causing double free, use-after-free.

To fix, we pass a generic incremental repair handler to the streaming
consumer. The handler is safe to be copied to different shards. It will
be a no op if incremental repair is not enabled or on a different shard.

A reproducer test is added. The test could reproduce the crash
consistently before the fix and work well after the fix.

Fixes #27666

Closes scylladb/scylladb#27870

(cherry picked from commit 0aabf51380)

Closes scylladb/scylladb#28064
2026-01-19 09:39:49 +02:00
Asias He
c5aa29404d repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#27679

(cherry picked from commit 4f77dd058d)

Closes scylladb/scylladb#28065
2026-01-19 09:39:13 +02:00
Nikos Dragazis
29c534c6e7 test: database_test: Fix serialization of partition key
The `make_key` lambda erroneously allocates a fixed 8-byte buffer
(`sizeof(s.size())`) for variable-length strings, potentially causing
uninitialized bytes to be included. If such bytes exist and they are
not valid UTF-8 characters, deserialization fails:

```
ERROR 2026-01-16 08:18:26,062 [shard 0:main] testlog - snapshot_list_contains_dropped_tables: cql env callback failed, error: exceptions::invalid_request_exception (Exception while binding column p1: marshaling error: Validation failed - non-UTF8 character in a UTF8 string, at byte offset 7)
```

Fixes #28195.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>

Closes scylladb/scylladb#28197

(cherry picked from commit 8aca7b0eb9)

Closes scylladb/scylladb#28209
2026-01-19 09:38:45 +02:00
Łukasz Paszkowski
c04007c755 database: Log message after critical_disk_utilization mode is set
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030

The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.

It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.

Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.

The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated

The tests are updated to wait for the second message.

Fixes https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26392

(cherry picked from commit 7ec369b900)

Closes scylladb/scylladb#26626
2026-01-19 06:42:01 +02:00
Botond Dénes
b738be094f Merge '[Backport 2025.4] Make commitlog replay handle files with corrupt file header (non-zero) as data loss, not startup failure' from Scylladb[bot]
Fixes #26744

If a segment to replay is broken such that the main header is not zero, but still broken, we throw header_checksum_error. This was not handled in replayer, which grouped this into the "user error/fundamental problem" category.

However, assuming we allow for "real" disk corruption, this should really be treated same as data corruption, i.e. reported data loss, not failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes provoked this issue, by doing random file wrecking, which on rare occasions provoked this, and thus failed test due to scylla not starting up, instead of losing data as expected.

- (cherry picked from commit 9b5f3d12a3)

- (cherry picked from commit e48170ca8e)

- (cherry picked from commit 8c4ac457af)

Parent PR: #27556

Closes scylladb/scylladb#27682

* github.com:scylladb/scylladb:
  test::cluster::dtest::tools::files: Remove file
  commitlog_replay: Handle fully corrupt files same as partial corruption.
  test::pylib::suite::base: Split options.name test specifier only once
2026-01-19 06:39:29 +02:00
Anna Stuchlik
64039588db doc: clarify the information about SSTable version support
Fixes https://github.com/scylladb/scylladb/issues/27765

Closes scylladb/scylladb#27835

(cherry picked from commit 791ab4ed02)

Closes scylladb/scylladb#28122
2026-01-16 16:19:47 +02:00
Calle Wilund
999dfb0e5e db::commitlog: Fix sanity check error on race between segment flushing and oversized alloc
Fixes #27992

When doing a commit log oversized allocation, we lock out all other writers by grabbing
the _request_controller semaphore fully (max capacity).
We thereafter assert that the semaphore is in fact zero. However, due to how things work
with the bookkeep here, the semaphore can in fact become negative (some paths will not
actually wait for the semaphore, because this could deadlock).

Thus, if, after we grab the semaphore and execution actually returns to us (task schedule),
new_buffer via segment::allocate is called (due to a non-fully-full segment), we might
in fact grab the segment overhead from zero, resulting in a negative semaphore.

The same problem applies later when we try to sanity check the return of our permits.

Fix is trivial, just accept less-than-zero values, and take same possible ltz-value
into account in exit check (returning units)

Added whitebox (special callback interface for sync) unit test that provokes/creates
the race condition explicitly (and reliably).

Closes scylladb/scylladb#27998

(cherry picked from commit a7cdb602e1)

Closes scylladb/scylladb#28099
2026-01-16 16:19:18 +02:00
Sergey Zolotukhin
96275adf1c test: disable test_start_bootstrapped_with_invalid_seed
The test intermittently fails when an invalid DNS name is resolved,
likely due to ISP DNS error hijacking (see scylladb/scylladb#28153).

Disable this test to unblock CI.

Fixes scylladb/scylladb#28153

Closes scylladb/scylladb#28162

(cherry picked from commit 799d837295)
2026-01-15 17:01:30 +02:00
Avi Kivity
8ae3db2aff Update seastar submodule (accept unbounded recursion)
* seastar 60e4b3b921...65f936e384 (1):
  > net: posix_server_socket_impl: coroutinize accept(), fix unbounded recursion

Fixes #28166
2026-01-15 14:23:09 +02:00
Jenkins Promoter
59da649741 Update pgo profiles - aarch64 2026-01-15 04:49:34 +02:00
Jenkins Promoter
f79fab6264 Update pgo profiles - x86_64 2026-01-15 03:50:54 +02:00
Michał Hudobski
2dc74d66cd auth: fix cdc vector search indexing permission bug
VECTOR_SEARCH_INDEXING permission didn't work on cdc tables as we mistakenly checked for vector indexes on the cdc table insted of the base.
This patch fixes that and adds a test that validates this behavior.

Fixes: VECTOR-476

Closes scylladb/scylladb#28050

(cherry picked from commit e2e479f20d)

Closes scylladb/scylladb#28068
2026-01-13 17:58:46 +01:00
Patryk Jędrzejczak
8a833e0400 Merge '[Backport 2025.4] raft topology: preserve IP -> ID mapping of a replacing node on restart' from Scylladb[bot]
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

Fixes #28057

Backport this PR to all branches as it fixes a problematic bug.

- (cherry picked from commit fc4c2df2ce)

- (cherry picked from commit 4526dd93b1)

- (cherry picked from commit 749b0278e5)

- (cherry picked from commit 0fed9f94f8)

Parent PR: #27435

Closes scylladb/scylladb#28100

* https://github.com/scylladb/scylladb:
  gossiper: add_saved_endpoint: make generations of excluded nodes negative
  test: introduce test_full_shutdown_during_replace
  utils: error_injection: allow aborting wait_for_message
  raft topology: preserve IP -> ID mapping of a replacing node on restart
2026-01-13 17:19:39 +01:00
Michał Jadwiszczak
cbbe8ef273 docs/dev/service_levels: update docs to service levels on raft
Since Scylla 6.0, service levels are manged by Raft group0.
This patch updates table name used by service levels and adds a
paragraph describing service levels on raft.

Fixes scylladb/scylladb#18177

Closes scylladb/scylladb#26556

(cherry picked from commit 649efd198f)

Closes scylladb/scylladb#28130
2026-01-13 16:06:36 +01:00
Patryk Jędrzejczak
1104022d91 gossiper: add_saved_endpoint: make generations of excluded nodes negative
The explanation is in the new comment in `gossiper::add_saved_endpoint`.

We add a test for this change. It's "extremely white-box", but it's better
than nothing.

(cherry picked from commit 0fed9f94f8)
2026-01-13 12:07:18 +01:00
Patryk Jędrzejczak
bd0f876fdb test: introduce test_full_shutdown_during_replace
(cherry picked from commit 749b0278e5)
2026-01-13 12:07:18 +01:00
Patryk Jędrzejczak
e2d97bd1f4 utils: error_injection: allow aborting wait_for_message
The test added in the following commit utilizes it.

(cherry picked from commit 4526dd93b1)
2026-01-13 12:07:18 +01:00
Patryk Jędrzejczak
ef95e1efeb raft topology: preserve IP -> ID mapping of a replacing node on restart
We currently do it only for a bootstrapping node, which is a bug. The
missing IP can cause an internal error, for example, in the following
scenario:
- replace fails during streaming,
- all live nodes are shut down before the rollback of replace completes,
- all live nodes are restarted,
- live nodes start hitting internal error in all operations that
  require IP of the replacing node (like client requests or REST API
  requests coming from nodetool).

We fix the bug here, but we do it separately for replace with different
IP and replace with the same IP.

For replace with different IP, we persist the IP -> host ID mapping
in `system.peers` just like for bootstrap. That's necessary, since there
is no other way to determine IP of the replacing node on restart.

For replace with the same IP, we can't do the same. This would require
deleting the row corresponding to the node being replaced from
`system.peers`. That's fine in theory, as that node is permanently
banned, so its IP shouldn't be needed. Unfortunately, we have many
places in the code where we assume that IP of a topology member is always
present in the address map or that a topology member is always present in
the gossiper endpoint set. Examples of such places:
- nodetool operations,
- REST API endpoints,
- `db::hints::manager::store_hint`,
- `group0_voter_handler::update_nodes`.

We could fix all those places and verify that drivers work properly when
they see a node in the token metadata, but not in `system.peers`.
However, that would be too risky to backport.

We take a different approach. We recover IP of the replacing node on
restart based on the state of the topology state machine and
`system.peers` just after loading `system.peers`.

We rely on the fact that group 0 is set up at this point. The only case
where this assumption is incorrect is a restart in the Raft-based
recovery procedure. However, hitting this problem then seems improbable,
and even if it happens, we can restart the node again after ensuring
that no client and REST API requests come before replace is rolled back
on the new topology coordinator. Hence, it's not worth to complicate the
fix (by e.g. looking at the persistent topology state instead of the
in-memory state machine).

(cherry picked from commit fc4c2df2ce)
2026-01-13 12:07:16 +01:00
Łukasz Paszkowski
d70c049e07 test_user_writes_rejection: Fix test flakiness caused by typo and non-local CL=ONE reads
The current code:
```
try:
   cql.execute(f"INSERT INTO {cf} (pk, t) VALUES (-1, 'x')", host=host[0], execution_profile=cl_one_profile).result()
except Exception:
   pass
```

contains a typo: `host=host[0]` which throws an exception becase Host
object is not subscriptable. The test does not fail because the except
block is too broad and suppresses all exceptions.

Fixing the typo alone is insufficient. The write still succeeds because
the remaining nodes are UP and the query uses CL=ONE, so no failure
should be expected.

Another source of flakiness is data verification:
```
SELECT * FROM {cf} WHERE pk = 0;
```

Even when a coordinator is explicitly provided, using CL=ONE does not
guarantee a local read. The coordinator may forward the read request to
another replica, causing the verification to fail nondeterministically.

This patch rewrites the tests to address these issues:
- Fix the typo: `host[0]` to `hosts[0]`
- Verify data using `MUTATION_FRAGMENTS({cf})` which guarantees a local
  read on the coordinator node
- Reconnect the driver after node restart

Fixes https://github.com/scylladb/scylladb/issues/27933

Closes scylladb/scylladb#27934

(cherry picked from commit 7bf26ece4d)

Closes scylladb/scylladb#28094
2026-01-12 14:15:10 +01:00
Patryk Jędrzejczak
7feafe9a62 Merge '[Backport 2025.4] database: truncate_table_on_all_shards: consider can_flush on all shards' from Scylladb[bot]
Currently, database::truncate_table_on_all_shards calls the table::can_flush only on the coordinator shard
and therefore it may miss shards with dirty data if the coordinator shard happens to have empty memtables, leading to clearing the memtables with dirty data rather than flushing them.

This change fixes that by making flush safe to be called, even if the memtable list is empty, and calling it on every shard that can flush (i.e. seal_immediate_fn is engaged).

Also, change database_test::do_with_some_data is use random keys instead of hard-coded key names, to reproduce this issue with `snapshot_list_contains_dropped_tables`.

Fixes #27639

* The issue exists since forever and might cause data loss due to wrongly clearing the memtable, so it needs backport to all live versions

- (cherry picked from commit ec4069246d)

- (cherry picked from commit 5be6b80936)

- (cherry picked from commit 0342a24ee0)

- (cherry picked from commit 02ee341a03)

- (cherry picked from commit 2a803d2261)

- (cherry picked from commit 93b827c185)

- (cherry picked from commit ebd667a8e0)

Parent PR: #27643

Closes scylladb/scylladb#28074

* https://github.com/scylladb/scylladb:
  test: database_test: do_with_some_data: randomize keys
  database: truncate_table_on_all_shards: drop outdated TODO comment
  database: truncate_table_on_all_shards: consider can_flush on all shards
  memtable_list: unify can_flush and may_flush
  test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
  replica: table, storage_group, compaction_group: add needs_flush
  test: database_test: do_with_some_data_in_thread: accept void callback function
2026-01-12 11:19:41 +01:00
Michael Litvak
3580fcaa79 db/view/view_update_generator: move discover_staging_sstables to start
Call discover_staging_sstables in view_update_generator::start() instead
of in the constructor, because the constructor is called during
initialization before sstables are loaded.

The initialization order was changed in 5d1f74b86a and caused this
regression. It means the view update generator won't discover staging
sstables on startup and view updates won't be generated for them. It
also causes issues in sstable cleanup.

view_update_generator::start() is called in a later stage of the
initialization, after sstable loading, so do the discovery of staging
sstables there.

Fixes scylladb/scylladb#27956

(cherry picked from commit 5077b69c06)

Closes scylladb/scylladb#28090
2026-01-12 10:33:11 +01:00
Tomasz Grabiec
5cb3900d90 test: cluster: Fix NoHostAvailable error in test_not_enough_token_owners
The driver must see server_c before we stop server_a, otherwise
there will be no live host in the pool when we attempt to drop
the keyspace:

```
   @pytest.mark.asyncio
    async def test_not_enough_token_owners(manager: ManagerClient):
        """
        Test that:
        - the first node in the cluster cannot be a zero-token node
        - removenode and decommission of the only token owner fail in the presence of zero-token nodes
        - removenode and decommission of a token owner fail in the presence of zero-token nodes if the number of token
          owners would fall below the RF of some keyspace using tablets
        """
        logging.info('Trying to add a zero-token server as the first server in the cluster')
        await manager.server_add(config={'join_ring': False},
                                 property_file={"dc": "dc1", "rack": "rz"},
                                 expected_error='Cannot start the first node in the cluster as zero-token')

        logging.info('Adding the first server')
        server_a = await manager.server_add(property_file={"dc": "dc1", "rack": "r1"})

        logging.info('Adding two zero-token servers')
        # The second server is needed only to preserve the Raft majority.
        server_b = (await manager.servers_add(2, config={'join_ring': False}, property_file={"dc": "dc1", "rack": "rz"}))[0]

        logging.info(f'Trying to decommission the only token owner {server_a}')
        await manager.decommission_node(server_a.server_id,
                                        expected_error='Cannot decommission the last token-owning node in the cluster')

        logging.info(f'Stopping {server_a}')
        await manager.server_stop_gracefully(server_a.server_id)

        logging.info(f'Trying to remove the only token owner {server_a} by {server_b}')
        await manager.remove_node(server_b.server_id, server_a.server_id,
                                  expected_error='cannot be removed because it is the last token-owning node in the cluster')

        logging.info(f'Starting {server_a}')
        await manager.server_start(server_a.server_id)

        logging.info('Adding a normal server')
        await manager.server_add(property_file={"dc": "dc1", "rack": "r2"})

        cql = manager.get_cql()

        await wait_for_cql_and_get_hosts(cql, [server_a], time.time() + 60)

>       async with new_test_keyspace(manager, "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }") as ks_name:

test/cluster/test_not_enough_token_owners.py:57:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
/usr/lib64/python3.14/contextlib.py:221: in __aexit__
    await anext(self.gen)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _

manager = <test.pylib.manager_client.ManagerClient object at 0x7f37efe00830>
opts = "WITH replication = {'class': 'NetworkTopologyStrategy', 'replication_factor': 2} AND tablets = { 'enabled': true }"
host = None

    @asynccontextmanager
    async def new_test_keyspace(manager: ManagerClient, opts, host=None):
        """
        A utility function for creating a new temporary keyspace with given
        options. It can be used in a "async with", as:
            async with new_test_keyspace(ManagerClient, '...') as keyspace:
        """
        keyspace = await create_new_test_keyspace(manager.get_cql(), opts, host)
        try:
            yield keyspace
        except:
            logger.info(f"Error happened while using keyspace '{keyspace}', the keyspace is left in place for investigation")
            raise
        else:
>           await manager.get_cql().run_async("DROP KEYSPACE " + keyspace, host=host)
E           cassandra.cluster.NoHostAvailable: ('Unable to complete the operation against any hosts', {<Host: 127.69.108.39:9042 dc1>: ConnectionException('Pool for 127.69.108.39:9042 is shutdown')})

test/cluster/util.py:544: NoHostAvailable
```

Fixes #28011

Closes scylladb/scylladb#28040

(cherry picked from commit 34df158605)

Closes scylladb/scylladb#28073
2026-01-09 19:10:11 +01:00
Łukasz Paszkowski
6c8663b1ec load_sketch: Allow populating load_sketch with normalized current load
Currently, tablet allocation intentionally ignores current load (
introduced by the commit #1e407ab) which could cause identical shard
selection when allocating a small number of tablets in the same topology.
When a tablet allocator is asked to allocate N tablets (where N is smaller
than the number of shards on a node), it selects the first N lowest shards.
If multiple such tables are created, each allocator run picks the same
shards, leading to tablet imbalance across shards.

This change initializes the load sketch with the current shard load,
scaled into the [0,1] range, ensuring allocation still remains even
while starting from globally least-loaded shards.

Fixes https://github.com/scylladb/scylladb/issues/27620

Closes https://github.com/scylladb/scylladb/pull/27802

Closes scylladb/scylladb#28060
2026-01-09 18:42:03 +01:00
Piotr Dulikowski
1df6ef365e Merge '[Backport 2025.4] service/storage_service: update service levels cache after upgrade to v2' from Scylladb[bot]
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this patch adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes [SCYLLADB-90](https://scylladb.atlassian.net/browse/SCYLLADB-90)

This fix should be backported to all versions containing service levels on Raft.

[SCYLLADB-90]: https://scylladb.atlassian.net/browse/SCYLLADB-90?atlOrigin=eyJpIjoiNWRkNTljNzYxNjVmNDY3MDlhMDU5Y2ZhYzA5YTRkZjUiLCJwIjoiZ2l0aHViLWNvbS1KU1cifQ

- (cherry picked from commit 53d0a2b5dc)

- (cherry picked from commit be16e42cb0)

Parent PR: #27585

Closes scylladb/scylladb#28075

* github.com:scylladb/scylladb:
  service/storage_service: update service levels cache after upgrade to v2
  service/storage_service: check if service levels were already upgraded before doing migration to raft
2026-01-09 17:51:49 +01:00
Anna Stuchlik
ec917fd5e7 doc: add the patch release upgrade procedure for version 2025.4
Adds the patch upgrade guide based on previous upgrade guides.

Fixes https://github.com/scylladb/scylladb/issues/27982

Closes scylladb/scylladb#27985

(cherry picked from commit f614482e66)

Closes scylladb/scylladb#28066
2026-01-09 16:23:27 +02:00
Michał Hudobski
133a92e86c auth: add system table permissions to VECTOR_SEARCH_INDEXING
Due to the recent changes in the vector store service,
the service needs to read two of the system tables
to function correctly. This was not accounted for
when the new permission was added. This patch fixes that
by allowing these tables (group0_history and versions)
to be read with the VECTOR_SEARCH_INDEXING permission.

We also add a test that validates this behavior.

Fixes: SCYLLADB-73

Closes scylladb/scylladb#27546

(cherry picked from commit ce3320a3ff)

Closes scylladb/scylladb#28042

Parent PR: #27546
2026-01-09 14:55:34 +01:00
Botond Dénes
1632853ebd reader_concurrency_semaphore: add protection against negative count resource leaks
The semaphore has detection and protection against regular resource
leaks, where some resources go unaccounted for and are not released by
the time the semaphore is destroyed. There is no detection or protection
against negative leaks: where resources are "made up" of thin air. This
kind of leaks looks benign at first sight, a few extra resources won't
hurt anyone so long as this is a small amount. But turns out that even a
single extra count resource can defeat a very important anti-deadlock
protection in can_admit_read(): the special case which admits a new
permit regardless of memory resources, when all original count resources
all available. This check uses ==, so if resource > original, the
protection is defeated indefinitely. Instead of just changing == to >=,
we add detection of such negative leaks to signal(), via
on_internal_error_noexcept().
At this time I still don't now how this negative leak happens (the code
doesn't confess), with this detection, hopefully we'll get a clue from
tests or the field. Note that on_internal_error_noexcept() will not
generate a coredump, unless ScyllaDB is explicitely configured to do so.
In production, it will just generate an error log with a backtrace.
The detection also clams the _resources to _initial_resources, to
prevent any damage from the negativae leak.

I just noticed that there is no unit test for the deadlock protection
described above, so one is added in this PR, even if only loosely
related to the rest of the patch.

Fixes: SCYLLADB-163

Closes scylladb/scylladb#27764

(cherry picked from commit e4da0afb8d)

Closes scylladb/scylladb#28004
2026-01-09 13:27:43 +02:00
Benny Halevy
07197ff820 test: database_test: do_with_some_data: randomize keys
With randomized keys, and since we're inserting only 2 keys,
it is possible that they would end up owned only by a single shard,
reproducing #27639 in snapshot_list_contains_dropped_tables.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ebd667a8e0)
2026-01-09 08:12:51 +02:00
Benny Halevy
67c23add98 database: truncate_table_on_all_shards: drop outdated TODO comment
The comment was added in 83323e155e
Since then, table::seal_active_memtable was improved to guarantee
waiting on oustanding flushes on success (See d55a2ac762), so
we can remove this TODO comment (it also not covered by any issue
so nobody is planned to ever work on it).

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 93b827c185)
2026-01-09 08:12:25 +02:00
Benny Halevy
9ff984f0c9 database: truncate_table_on_all_shards: consider can_flush on all shards
can_flush might return a different value for each shard
so check it right before deciding whether to flush or clear a memtable
shard.

Note that under normal condition can_flush would always return true
now that it checks only the presence of the seal memtable function
rather than check memtable_list::empty().

Fixes #27639

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 2a803d2261)
2026-01-09 08:11:58 +02:00
Benny Halevy
ebfdb83270 memtable_list: unify can_flush and may_flush
Now that we have a unit test proving that it's safe to flush an
empty memtable list there is no need to distinguish between
may_flush and can_flush.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 02ee341a03)
2026-01-09 08:09:06 +02:00
Benny Halevy
e9f76e46d1 test: database_test: add test_flush_empty_table_waits_on_outstanding_flush
Test that table::flush waits on outstanding flushes, even if the active memtable is empty

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 0342a24ee0)
2026-01-09 08:09:04 +02:00
Benny Halevy
e9ea043980 replica: table, storage_group, compaction_group: add needs_flush
Table needs flush if not all its memtable lists are empty.
To be used in the next patch for a unit test.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5be6b80936)
2026-01-09 08:04:58 +02:00
Benny Halevy
a95f7c7aaa test: database_test: do_with_some_data_in_thread: accept void callback function
Many test cases already assume `func` is being called a seastar
thread and although the function they pass returns a (ready) future,
it serves no purpose other than to conform to the interface.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit ec4069246d)
2026-01-09 07:58:12 +02:00
Jenkins Promoter
488be3c52d Update ScyllaDB version to: 2025.4.2 2026-01-09 06:29:53 +02:00
Michał Jadwiszczak
29fc0d480c service/storage_service: update service levels cache after upgrade to v2
Service levels cache is empty after upgrade to consistent topology
if no mutations are commited to `system.service_levels_v2` or rolling
restart is not done.

To fix the bug, this commit adds service levels cache reloading after
upgrading the SL data accessor to v2 in `storage_service::topology_state_load()`.

Fixes SCYLLADB-90

(cherry picked from commit be16e42cb0)
2026-01-08 22:46:36 +00:00
Michał Jadwiszczak
875b1ecacf service/storage_service: check if service levels were already upgraded
before doing migration to raft

There is no need to call `service_level_controller::upgrade_to_v2()`
on every topology state load, we only need to do it once.

(cherry picked from commit 53d0a2b5dc)
2026-01-08 22:46:36 +00:00
Dawid Mędrek
f5c6310c66 db/hints: Prevent draining hints before hint replay is allowed
Context
-------
The procedure of hint draining boils down to the following steps:

1. Drain a hint sender. That should get rid of all hints stored
   for the corresponding endpoint.
2. Remove the hint directory corresponding to that endpoint.

Obviously, it gets more complex than this high-level perspective.
Without blurring the view, the relevant information is that step 1
in the algorithm above may not be executed.

Breaking it down, it comprises of two calls to
`hint_sender::send_hints_maybe()`. The function is responsible for
sending out hints, but it's not unconditional and will not be performed
if any of the following bullets is not satisfied:

* `hint_sender::replay_allowed()` is not `true`. This can happen when
  hint replay hasn't been turned on yet.
* `hint_sender::can_send()` is not `true`. This can happen if the
  corresponding endpoint is not alive AND it hasn't left the cluster
  AND it's still a normal token owner.

There is one more relevant point: sending hints can be stopped if
replaying hints fails and `hint_sender::send_hints_maybe()` returns
`false`. However, that's not not possible in the case of draining.
In that case, if Scylla comes across any failure, it'll simply delete
the corresponding hint segment. Because of that, we ignore it and
only focus on the two bullets.

---

Why is it a problem?
--------------------
If a hint directory is not purged of all hint segments in it,
any attempt to remove it will fail and we'll observe an error like this:

```
Exception when draining <host ID>: std::filesystem::__cxx11::filesystem_error
(error system:39, filesystem error: remove failed: Directory not empty [<path>])
```

The folder with the remaining hints will also stay on disk, which is, of
course, undesired.

---

When can it happen?
-------------------
As highlighted in the Context section of this commit message, the
key part of the code that can lead to a dangerous situation like that
is `hint_sender::send_hints_maybe()`. The function is called twice when
draining a hint endpoint manager: once to purge all of the existing
hints, and another time after flushing all hints stored in a commitlog
instances, but not listed by `hint_sender` yet. If any of those calls
misbehaves, we may end up with a problem. That's why it's crucial to
ensure that the function always goes through ALL of the hints.

Dangerous situations:

1. We try to drain hints before hint replay is allowed. That will
   violate the first bullet above.
2. The node we're draining is dead, but it hasn't left the cluster,
   and it still possesses some tokens.

---

How do we solve that?
---------------------
Hint replay is turned on in `main.cc`. Once enabled, it cannot be
disabled. So to address the first bullet above, it suffices to ensure
that no draining occurs beforehand. It's perfectly fine to prevent it.
Soon after hint replay is allowed, `main.cc` also asks the hint manager
to drain all of the endpoint managers whose endpoints are no longer
normal token owners (cf. `db::hints::manager::drain_left_nodes()`).

The other bullet is more tricky. It's important here to know that
draining only initiated in three situations:

1. As part of the call to `storage_service::notify_left()`.
2. As part of the call to `storage_service::notify_released()`.
3. As part of the call to `db::hints::manager::drain_left_nodes()`.

The last one is trivially non-problematic. The nodes that it'll try to
drain are no longer normal token owners, so `can_send()` must always
return `true`.

The second situation is similar. As we read in the commit message of
scylladb/scylladb@eb92f50413, which
introduced the notion of released nodes, the nodes are no longer
normal token owners:

> In this patch we postpone the hint draining for the "left" nodes to
> the time when we know that the target nodes no longer hold ownership
> of any tokens - so they're no longer referenced in topology. I'm
> calling such nodes "released".

I suggest reading the full commit message there because the problems
there are somewhat similar these changes try to solve.

Finally, the first situation: unfortunately, it's more tricky. The same
commit message says:

> When a node is being replaced, it enters a "left" state while still
> owning tokens. Before this patch, this is also the time when we start
> draining hints targeted to this node, so the hints may get sent before
> the token ownership gets migrated to another replica, and these hints
> may get lost.

This suggests that `storage_service::notify_left()` may be called when
the corresponding node still has some tokens! That's something that may
prevent properly draining hints.

Fortunately, no hope is lost. We only drain hints via `notify_left()`
when hinted handoff hasn't been upgraded to being host-ID-based yet.
If it has, draining always happens via `notify_released()`.

When I write this commit message, all of the supported versions of
Scylla 2025.1+ use host-ID-based hinted handoff. That means that
problems can only arise when upgrading from an older version of Scylla
(2024.1 downwards). Because of that, we don't cover it. It would most
likely require more extensive changes.

---

Non-issues
----------
There are notions that are closely related to sending hints. One of them
is the host filter that hinted handoff uses. It decides which endpoints
are eligible for receiving hints, and which are not. Fortunately, all
endpoints rejected by the host filter lose their hint endpoint managers
-- they're stopped as part of that procedure. What's more, draining
hints and changing the host filter cannot be happening at the same time,
so it cannot lead to any problems.

The solution
------------
To solve the described issue, we simply prevent draining hints before
hint replay is allowed. No reproducer test is attached because it's not
feasible to write one.

Fixes scylladb/scylladb#27693

Closes scylladb/scylladb#27713

(cherry picked from commit 77a934e5b9)

Closes scylladb/scylladb#27972
2026-01-08 17:52:28 +02:00
Anna Stuchlik
f68b032ce9 doc: remove cassandra-stress from installation instructions
The cassandra-stress tool is no longer part of the default package
and cannot be run in the way described.

This commit removes the instruction to run cassandra-stress.

Fixes https://github.com/scylladb/scylladb/issues/24994

Closes scylladb/scylladb#27726

(cherry picked from commit 624869de86)

Closes scylladb/scylladb#27951
2026-01-08 16:43:29 +02:00
Benny Halevy
c6593b3e8f db: system_keyspace: get_group0_history: unfreeze_gently
Prevent stall when the group0 history is too long using unfreeze_gently
rather than the synchronous unfreeze() function

Fixes #27872

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27873

(cherry picked from commit f60033db63)

Closes scylladb/scylladb#27909
2026-01-08 16:43:06 +02:00
Avi Kivity
85f42de3f4 test: sstable_validation_test: actually test ms version
sstable_validation_test tests the `scylla sstable validate` command
by passing it intentionally corrupted sstables. It uses an sstable
cache to avoid re-creating the same sstables. However, the cache
does not consider the sstable version, so if called twice with the
same inputs for different versions, it will return an sstable with
the original version for both calls. As a results, `ms` sstables
were not tested. Fix this bug by adding the sstable version (and
the schema for good measure) to the cache key.

An additional bug, hidden by the first, was that we corrupted the
sstable by overwriting its Index.db component. But `ms` sstables
don't have an Index.db component, they have a Partitions.db component.
Adjust the corrupting code to take that into account.

With these two fixes, test_scylla_sstable_validate_mismatching_partition_large
fails on `ms` sstables. Disable it for that version. Since it was
previously practically untested, we're not losing any coverage.

Fixing this test unblocks further work on making pytest take charge
of running the tests. pytest exposed this problem, likely by running
it on different runners (and thus reducing the effectiveness of the
cache).

Fixes #27822.

Closes scylladb/scylladb#27825

(cherry picked from commit fc81983d42)

Closes scylladb/scylladb#27863
2026-01-08 16:42:41 +02:00
Michał Jadwiszczak
99842b30e3 test/cluster/test_view_building_coordinator: fix flakiness in test_file_streaming
The test generates a staging sstable on a node and verifies whether
the view is correctly populated.

However view updates generated by a staging sstable
(`view_update_generator::generate_and_propagate_view_updates()`) aren't
awaited by sstable consumer.
It's possible that the view building coordinator may see the task as finished
(so the staging sstable was processed) but not all view updates were
writted yet.

This patch fixes the flakiness by waiting until
`scylla_database_view_update_backlog` drops down to 0 on all shards.

Fixes scylladb/scylladb#26683

Closes scylladb/scylladb#27389

(cherry picked from commit 74ab5addd3)

Closes scylladb/scylladb#27739
2026-01-08 16:42:08 +02:00
Lakshmi Narayanan Sreethar
d27849b1d3 sstables: prevent oversized allocation when parsing summary positions
During sstable summary parsing, the entire header was read into a single
buffer upfront and then parsed to obtain the positions. If the header
was too large, it could trigger oversized allocation warnings.

This commit updates the parse method to read one position at a time from
the input stream instead of reading the entire header at once. Since
`random_access_reader` already maintains an internal buffer of 128 KB,
there is no need to pre read the entire header upfront.

Fixes #24428
Fixes #27590

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26846

(cherry picked from commit 3eba90041f)

Closes scylladb/scylladb#27638
2026-01-08 16:41:30 +02:00
Benny Halevy
67cce6434b utils: error_injection: wait_for_message: print injection_name and caller source_location on timeout
When waiting for the condition variable times out
we call on_internal_error, but unfortunately, the backtrace
it generates is obfuscated by
`coroutine_handle<seastar::internal::coroutine_traits_base<void>::promise_type>::resume`.

To make the log more useful, print the error injection name
and the caller's source_location in the timeout error message.

Fixes #27531

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27532

(cherry picked from commit 5f13880a91)

Closes scylladb/scylladb#27584
2026-01-08 16:40:58 +02:00
Botond Dénes
6fb37ae77f Merge '[Backport 2025.4] alternator: fix batch writes during intranode tablet migrations' from Scylladb[bot]
Scylla implements `LWT` in the` storage_proxy::cas` method. This method expects to be called on a specific shard, represented by the `cas_shard` parameter. Clients must create this object before calling `storage_proxy::cas`, check its `this_shard()` method, and jump to `cas_shard.shard()` if it returns false.

The nuance is that by the time the request reaches the destination shard, the tablet may have already advanced in its migration state machine. For example, a client may acquire a `cas_shard` at the `streaming` tablet state, then submit a request to another shard via `smp::submit_to(cas_shard.shard())`. However, the new `cas_shard` created on that other shard might already be in the `write_both_read_new` state, and its `cas_shard.shard()` would not be equal to `this_shard_id()`. Such broken invariant results in an `on_internal_error` in `storage_proxy::cas`.

Clients of `storage_proxy::cas` are expected to check` cas_shard.this_shard()` and recursively jump to another shard if it returns false. Most calls to `storage_proxy::cas` already implement this logic. The only exception is `executor::do_batch_write`, which currently checks `cas_shard.this_shard()` only once. This can break the invariant if the tablet state changes more than once during the operation.

This PR fixes the issue by implementing recursive `cas_shard.this_shard()` checks in `executor::do_batch_write`. It also adds a test that reproduces the problem.

Fixes: scylladb/scylladb#27353

backport: need to be backported to 2025.4

- (cherry picked from commit e60bcd0011)

- (cherry picked from commit 74bf24a4a7)

- (cherry picked from commit 9bef142328)

- (cherry picked from commit c6eec4eeef)

- (cherry picked from commit 3a865fe991)

- (cherry picked from commit 0bcc2977bb)

- (cherry picked from commit 608eee0357)

Parent PR: #27396

Closes scylladb/scylladb#27529

* github.com:scylladb/scylladb:
  alternator/executor.cc: eliminate redundant dk copy
  alternator/executor.cc: release cas_shard on the original shard
  alternator/executor.cc: move shard check into cas_write
  alternator/executor.cc: make cas_write a private method
  alternator/executor.cc: make do_batch_write a private method
  alternator/executor.cc: fix indent
  test_alternator: add test_alternator_invalid_shard_for_lwt
  alternator/executor.cc: avoid cross-shard free
2026-01-08 16:40:20 +02:00
Dawid Mędrek
088be56347 test/cluster/mv: Rewrite test_view_building_scheduling_group
We rewrite the test to avoid flakiness. Instead of looking at the
metrics, we make a trade-off and start depending on a less reliable
mechanism -- logs. We grep all relevant messages printed by Scylla
in TRACE mode and make sure that they were all printed from a context
using the streaming scheduling group.

Although it's a "less proper" way of testing, it should be much more
dependable and avoid flakiness.

Fixes scylladb/scylladb#25957

Closes scylladb/scylladb#26656

(cherry picked from commit 58dc414912)

Closes scylladb/scylladb#27504
2026-01-08 16:39:46 +02:00
Asias He
4bbdee8089 repair: Allow min max range to be updated for repair history
It is observed that:

repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])

This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.

To fix, we can relax the check of (start, end] for the min max range.

Fixes #27220

Closes scylladb/scylladb#27357

(cherry picked from commit e97a504775)

Closes scylladb/scylladb#27461
2026-01-08 16:39:10 +02:00
Calle Wilund
ddc22a349c encryption::gcp_host: Add exponential retry for server errors
Fixes #27242

Similar to AWS, google services may at times simply return a 503,
more or less meaning "busy, please retry". We rely for most cases
higher up layers to handle said retry, but we cannot fully do so,
because both we reach this code sometimes through paths that do
no such thing, and also because it would be slightly inefficient,
since we'd like to for example control the back-off for auth etc.

This simply changes the existing retry loop in gcp_host to
be a little more forgiving, special case 503 errors and extend
the retry to the auth part, as well as re-use the
exponential_backoff_retry primitive.

v2:
* Avoid backoff if refreshing credentials. Should not add latency due to this.
* Only allow re-auth once per (non-service-failure-backoff) try.
* Add abort source to both request and retry
v3:
* Include timeout and other server errors in retry-backoff
v4:
* Reorder error code handling correctly

Closes scylladb/scylladb#27267

(cherry picked from commit 4169bdb7a6)

Closes scylladb/scylladb#27443
2026-01-08 16:38:32 +02:00
Botond Dénes
6308304c9b Merge '[Backport 2025.4] topology_coordinator: Add barrier to cleanup_target' from Scylladb[bot]
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512

It's a pre existing issue. Backport is required to all recent 2025.x versions.

- (cherry picked from commit 669286b1d6)

- (cherry picked from commit 67f1c6d36c)

- (cherry picked from commit 6163fedd2e)

Parent PR: #27413

Closes scylladb/scylladb#27428

* github.com:scylladb/scylladb:
  topology_coordinator: Fix the indentation for the cleanup_target case
  topology_coordinator: Add barrier to cleanup_target
  test_node_failure_during_tablet_migration: Increase RF from 2 to 3
2026-01-08 16:37:52 +02:00
Calle Wilund
5b8d6e21f1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236

(cherry picked from commit 59c87025d1)

Closes scylladb/scylladb#27346
2026-01-08 16:37:22 +02:00
Aleksandra Martyniuk
50e2a1a9b0 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165

(cherry picked from commit 19a7d8e248)

Closes scylladb/scylladb#27200
2026-01-08 16:36:34 +02:00
Botond Dénes
bd58857680 Merge '[Backport 2025.4] db: batchlog_manager: update _last_replay only if all batches were re…' from Scylladb[bot]
…played

Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.

Needs backport to all live versions.

- (cherry picked from commit 4d0de1126f)

- (cherry picked from commit e3dcb7e827)

Parent PR: #26793

Closes scylladb/scylladb#27094

* github.com:scylladb/scylladb:
  test: extend test_batchlog_replay_failure_during_repair
  db: batchlog_manager: update _last_replay only if all batches were replayed
2026-01-08 16:36:04 +02:00
Patryk Jędrzejczak
ae5eebc04a Merge '[Backport 2025.4] test/raft: use valid sentinel in liveness check to prevent digest errors' from Scylladb[bot]
Replace -1 with 0 for the liveness check operation to avoid triggering digest validation failures. This prevents rare fatal errors when the cluster is recovering and ensures the test does not violate append_seq invariants.

The value -1 was causing invalid digest results in the append_seq structure, leading to assertion failures. This could happen when the sentinel value was the first (or only) element being appended, resulting in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is a valid elem_t that does not violate the invariants of the append_seq structure. In particular, the sentinel value is typically used only when no valid result is received from any server in the current loop iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

Backporting to active branches - this is a test-only fix (low risk) for a flaky test that exists in older branches (thus affects the CI of active branches).

- (cherry picked from commit 3af5183633)

- (cherry picked from commit 4ba3e90f33)

Parent PR: #28010

Closes scylladb/scylladb#28038

* https://github.com/scylladb/scylladb:
  test/raft: use valid sentinel in liveness check to prevent digest errors
  test/raft: improve debugging in randomized_nemesis_test
  test/raft: improve reporting in the randomized_nemesis_test digest functions
2026-01-08 15:34:00 +01:00
Anna Stuchlik
eea09f8565 doc: remove references to ScyllaDB versions 4.3 and 4.4
We should never refer to the no longer supported OSS versions.
This is a leftover - other mentions were removed long time ago.

Fixes https://github.com/scylladb/scylladb/issues/19569

Closes scylladb/scylladb#27656

(cherry picked from commit ea6f2a21c6)

Closes scylladb/scylladb#27683
2026-01-08 15:14:50 +01:00
Anna Stuchlik
a003f47def doc: fix the syntax of internal links
Some internal links had the wrong syntax: they were formatted as external links.
As a result, they redirected the user to the outdated Open Source documentation.
This commit fixes that bug.

Fixes https://github.com/scylladb/scylladb/issues/25899

Closes scylladb/scylladb#27905

(cherry picked from commit 375479d96c)

Closes scylladb/scylladb#28003
2026-01-08 14:57:39 +02:00
Emil Maskovsky
b1dcbc2199 test/raft: use valid sentinel in liveness check to prevent digest errors
Replace -1 with 0 for the liveness check operation to avoid triggering
digest validation failures. This prevents rare fatal errors when the
cluster is recovering and ensures the test does not violate append_seq
invariants.

The value -1 was causing invalid digest results in the append_seq
structure, leading to assertion failures. This could happen when the
sentinel value was the first (or only) element being appended, resulting
in a digest that did not match the expected value.

By using 0 instead, we ensure that the digest calculations remain valid
and consistent with the expected behavior of the test.

The specific value of the sentinel is not important, as long as it is
a valid elem_t that does not violate the invariants of the append_seq
structure. In particular, the sentinel value is typically used only
when no valid result is received from any server in the current loop
iteration, in which case the loop will retry.

Fixes: scylladb/scylladb#27307

(cherry picked from commit 4ba3e90f33)
2026-01-08 11:53:19 +01:00
Emil Maskovsky
404fee4568 test/raft: improve debugging in randomized_nemesis_test
Move the post-condition check before the assertion to ensure it is
always executed first. Before, the wrong value could be passed to the
digest_remove assertion, making the pre-check trigger there instead of
the post-check as expected.

Also, add a check in the append_seq constructor to ensure that the
digest value is valid when creating an append_seq object.

(cherry picked from commit 3af5183633)
2026-01-08 11:53:15 +01:00
Emil Maskovsky
0fe860910a test/raft: improve reporting in the randomized_nemesis_test digest functions
The Boost ASSERTs in the digest functions of the randomized_nemesis_test
were not working well inside the state machine digest functions, leading
to unhelpful boost::execution_exception errors that terminated the apply
fiber, and didn't provide any helpful information.

Replaced by explicit checks with on_fatal_internal_error calls that
provide more context about the failure. Also added validation of the
digest value after appending or removing an element, which allows to
determine which operation resulted in causing the wrong value.

This effectively reverts the changes done in https://github.com/scylladb/scylladb/pull/19282,
but adds improved error reporting.

Refs: scylladb/scylladb#27307
Refs: scylladb/scylladb#17030

(cherry picked from commit d60b908a8e)
2026-01-08 11:53:07 +01:00
Jenkins Promoter
812750f770 Update pgo profiles - x86_64 2026-01-04 01:40:49 -05:00
Asias He
392c65b83f topology_coordinator: Ensure repair_update_compaction_ctrl is executed
Consider this:

- n1 is a coordinator and schedules tablet repair
- n1 detects tablet repair failed, so it schedules tablet transition to end_repair state
- n1 loses leadership and n2 becomes the new topology coordinator
- n2 runs end_repair on the tablet with session_id=00000000-0000-0000-0000-000000000000
- when a new tablet repair is scheduled, it hangs since the lock is already taken because it was not removed in previous step

To fix, we use the global_tablet_id to index the lock instead of the
session id.

In addition, we retry the repair_update_compaction_ctrl verb in case of
error to ensure the verb is eventually executed. The verb handler is
also updated to check if it is still in end_repair stage.

Fixes #26346

Closes scylladb/scylladb#27740

(cherry picked from commit 3abda7d15e)

Closes scylladb/scylladb#27940
2026-01-01 14:08:29 +02:00
Jenkins Promoter
8dcbe011e6 Update pgo profiles - aarch64 2026-01-01 04:53:35 +02:00
Botond Dénes
c697c6633b Merge 'Remove noexcept from storage_group and table functions to allow exception propagation' from Tomasz Grabiec
Fixed a critical bug where
`storage_group::for_each_compaction_group()` was incorrectly marked
`noexcept`, causing `std::terminate` when actions threw exceptions
(e.g., `utils::memory_limit_reached` during memory-constrained reader
creation).

**Changes made:**

1. Removed `noexcept` from `storage_group::for_each_compaction_group()` declaration and implementation
2. Removed `noexcept` from `storage_group::compaction_groups()` overloads (they call for_each_compaction_group)
3. Removed `noexcept` from `storage_group::live_disk_space_used()` and `memtable_count()` (they call compaction_groups())
4. Kept `noexcept` on `storage_group::flush()` - it's a coroutine that automatically captures exceptions and returns them as exceptional futures
5. Removed `noexcept` from `table_load_stats()` functions in base class, table, and storage group managers

**Rationale:**

There's no reason to kill the server if these functions throw. For
coroutines returning futures, `noexcept` is appropriate because
Seastar automatically captures exceptions and returns them as
exceptional futures. For other functions, proper exception handling
allows the system to recover gracefully instead of terminating.

Fixes #27475

Closes scylladb/scylladb#27476

* github.com:scylladb/scylladb:
  replica: Remove unnecessary noexcept
  replica: Remove noexcept from compaction_groups() functions
  replica: Remove noexcept from storage_group::for_each_compaction_group

(cherry picked from commit 730eca5dac)

Closes scylladb/scylladb#27914
2025-12-30 14:23:30 +01:00
Gleb Natapov
9e205cc3a6 raft topology: Notify that a node was removed only once
Raft topology goes over all nodes in a 'left' state and triggers 'remove
node' notification in case id/ip mapping is available (meaning the node
left recently), but the problem is that, since the mapping is not removed
immediately, when multiple nodes are removed in succession a notification
for the same node can be sent several times. Fix that by sending
notification only if the node still exists in the peers table. It will
be removed by the first notification and following notification will not
be sent.

Closes scylladb/scylladb#27743

(cherry picked from commit 4a5292e815)

Closes scylladb/scylladb#27913
2025-12-30 11:17:41 +01:00
Dario Mirovic
fa3146e76f test: dtest: audit_test.py: fix audit error log detection
`test_insert_failure_doesnt_report_success` test in `test/cluster/dtest/audit_test.py`
has an insert statement that is expected to fail. Dtest environment uses
`FlakyRetryPolicy`, which has `max_retries = 5`. 1 initial fail and 5 retry fails
means we expect 6 error audit logs.

The test failed because `create keyspace ks` failed once, then succeeded on retry.
It allowed the test to proceed properly, but the last part of the test that expects
exactly 6 failed queries actually had 7.

The goal of this patch is to make sure there are exactly 6 = 1 + `max_retries` failed
queries, counting only the query expected to fail. If other queries fail with
successful retry, it's fine. If other queries fail without successful retry, the test
will fail, as it should in such situations. They are not related to this expected
failed insert statement.

Fixes #27322

Closes scylladb/scylladb#27378

(cherry picked from commit f545ed37bc)

Closes scylladb/scylladb#27582
2025-12-29 18:12:45 +02:00
Nadav Har'El
bc87366b32 Merge '[Backport 2025.4] test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions thr…' from Scylladb[bot]
…eshold

The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore. Initial implementation ran a code path once, and asserted there were 0 exceptions. Sometimes an exception or several can occur, not directly related to the code paths the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there would be at least as many exceptions thrown as there are test runs. If there is no regression, a few exceptions might happen, up to 10 per 100 test runs. I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging exceptions and parsing them. I would have to filter exception logs only for wanted exceptions. However, if a new, different exception is introduced, it might not be counted.

Another approach is to just increase the threshold a bit. The issue of throwing more exceptions than before in some other server modules should be addressed by a set of tests for that module, just like these tests check protocol exceptions, not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold` to `20`. It will not make the tests unreliable, because, as mentioned, if there is a regression, there would be at least `run_count` exceptions per `run_count` test runs (1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too is doubled, from `100` to `200`. At the first glance this looks like nothing is changed, but actually doubling both run count and exception threshold here implies that the burst does not scale as much as run count, it is just that the "jitter" is bigger than the old threshold.

Also, this patch series enables debug logging for `exception` logger. This will allow us to inspect which exceptions happened if a protocol exceptions test fails again.

Fixes #27247
Fixes #27325

Issue observed on master and branch-2025.4. The tests, in the same form, exist on master, branch-2025.4, branch-2025.3, branch-2025.2, and branch-2025.1. Code change is simple, and no issue is expected with backport automation. Thus, backports for all the aforementioned versions is requested.

- (cherry picked from commit 807fc68dc5)

- (cherry picked from commit c30b326033)

Parent PR: #27412

Closes scylladb/scylladb#27555

* github.com:scylladb/scylladb:
  test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
  test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
2025-12-29 11:27:16 +02:00
Gleb Natapov
1f8c2744a4 topology coordinator: set session id for streaming at the correct time
Commit d3efb3ab6f added streaming session for rebuild, but it set
the session and request submission time. The session should be set when
request starts the execution, so this patch moved it to the correct
place.

Closes scylladb/scylladb#27757

(cherry picked from commit 04976875cc)

Closes scylladb/scylladb#27867
2025-12-28 13:32:44 +02:00
Ferenc Szili
c08b2290dc test: fix flakyness caused by TRUNCATE retries
The test test_truncate_during_topology_change tests TRUNCATE TABLE while
bootstrapping a new node. With tablets enabled TRUNCATE is a global
topology operation which needs to serialize with boostrap.

When TRUNCATE TABLE is issued, it first checks if there is an already
queued truncate for the same table. This can happen if a previous
TRUNCATE operation has timed out, and the client retried. The newly
issued truncate will only join the queued one if it is waiting to be
processed, and will fail immediatelly if the TRUNCATE is already being
processed.

In this test, TRUNCATE will be retried after a timeout (1 minute) due to
the default retry policy, and will be retried up to 3 times, while the
bootstrap is delayed by 2 minutes. This means that the test can validate
the result of a truncate which was started after bootstrap was
completed.

Because of the way truncate joins existing truncate operations, we can
also have the following scenario:
- TRUNCATE times out after one minute because the new node is being
  bootstrapped
- the client retries the TRUNCATE command which also times out after 1m
- the third attempt is received during TRUNCATE being processed which
  fails the test

This patch changes the retry policy of the TRUNCATE operation to
FallthroughRetryPolicy which guarantees that TRUNCATE will not be
retried on timeout. It also increases the timeout of the TRUNCATE from 1
to 4 minutes. This way the test will actually validate the performance
of the TRUNCATE operation which was issued during bootstrap, instead of
the subsequent, retried TRUNCATEs which could have been issued after the
bootstrap was complete.

Fixes: #26347

Closes scylladb/scylladb#27245

(cherry picked from commit d883ff2317)

Closes scylladb/scylladb#27507
2025-12-23 17:06:48 +02:00
Anna Stuchlik
bfff9ebe15 doc: document support for i8g and i8ge instances
Fixes https://github.com/scylladb/scylladb/issues/27703

Closes scylladb/scylladb#27754

(cherry picked from commit 4c247a5d08)

Closes scylladb/scylladb#27827
2025-12-23 10:47:34 +02:00
Anna Stuchlik
0ed82c1877 doc: add a Vector Search page under Features
This commit adds a page with an overview of Vector Search under the Features section.
It includes a link to the VS documentation in ScyllaDB Cloud,
as the feature is only available in ScyllaDB Cloud.

The purpose of the page is to raise awareness of the feature.

Fixes https://scylladb.atlassian.net/browse/VECTOR-215

Closes scylladb/scylladb#27787

(cherry picked from commit 9793a45288)

Closes scylladb/scylladb#27826
2025-12-23 10:15:23 +02:00
Karol Nowacki
1c0891d577 vector_search: test: Fix flaky DNS resolution test
The `vector_store_client_test_dns_resolving_repeated` test had race
conditions causing it to be flaky. Two main issues were identified:

1. Race between initial refresh and manual trigger: The test assumes
    a specific resolution sequence, but timing variations between the
    initial DNS refresh (on client creation) and the first manual
    trigger (in the test loop) can cause unexpected delayed scheduling.

2. Extra triggers from resolve_hostname fiber: During the client
    refresh phase, the background DNS fiber clears the client list.
    If resolve_hostname executes in the window after clearing but
    before the update completes, pending triggers are processed,
    incrementing the resolution count unexpectedly. At count 6, the
    mock resolver returns a valid address (count % 3 == 0), causing
    the test to fail.

The fix relaxes test assertions to verify retry behavior and client
clearing on DNS address loss, rather than enforcing exact resolution
counts.

Fixes: #27074

Closes scylladb/scylladb#27685

(cherry picked from commit addac8b3f7)

Closes scylladb/scylladb#27799
2025-12-23 09:13:22 +02:00
Aleksandra Martyniuk
fc9aac0a58 test: extend test_batchlog_replay_failure_during_repair
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.

(cherry picked from commit e3dcb7e827)
2025-12-22 14:45:08 +01:00
Aleksandra Martyniuk
9f339ec3e0 db: batchlog_manager: update _last_replay only if all batches were replayed
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.
(cherry picked from commit 4d0de1126f)
2025-12-22 14:44:52 +01:00
Michał Hudobski
af14df5459 vector_search: throw an error when we restrict primary in vector search
We currently allow restrictions on single column primary key,
but we ignore the restriction and return all results.
This can confuse the users. We change it so such a restriction
will throw an error and add a test to validate it.

Fixes: VECTOR-331

Closes scylladb/scylladb#27668
2025-12-21 19:29:03 +02:00
Emil Maskovsky
49306c76f0 test/raft: fix race condition in failure_detector_test
The test had a sporadic failure due to a broken promise exception.
The issue was in `test_pinger::ping()` which captured the promise by
move into the subscription lambda, causing the promise to be destroyed
when the lambda was destroyed during coroutine unwinding.

Simplify `test_pinger::ping()` by replacing manual abort_source/promise
logic with `seastar::sleep_abortable()`.
This removes the risk of promise lifetime/race issues and makes the code
simpler and more robust.

Fixes: scylladb/scylladb#27136

Backport to active branches: This fixes a CI test issue, so it is
beneficial to backport the fix. As this is a test-only fix, it is a low
risk change.

Closes scylladb/scylladb#27737

(cherry picked from commit 2a75b1374e)

Closes scylladb/scylladb#27784
2025-12-21 19:26:20 +02:00
Anna Stuchlik
4888f5b008 doc: remove the links to the Download Center
This commit removes the remaining links to the Download Center on the website.
We no longer use it for installation, and we don't want users to infer that
something like that still exists.

Fixes https://github.com/scylladb/scylladb/issues/27753

Closes scylladb/scylladb#27756

(cherry picked from commit f65db4e8eb)

Closes scylladb/scylladb#27785
2025-12-21 19:22:51 +02:00
Benny Halevy
d59beb52ce sstable: add _mutate_sem to serialize link/move with components rewrite
We currently have races, like between moving an sstable from staging
using change_state, or when taking a snapshot, to e.g.
rewrite_statistics that replaces one of the sstable component files
when called, for example, from update_repaired_at by incremental repair.

Use a semaphore as a mutex to serialize those functions.
Note that there is no need for rwlock since the operations
are rare and read-only operations like snapshot don't
need to run in parallel.

Fixes #25919

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 9e18cfbe17)

Closes scylladb/scylladb#27751
2025-12-21 19:22:26 +02:00
Ernest Zaslavsky
404965808c streaming:: add more logging
Start logging all missed streaming options like `scope`, `primary_replica` and `skip_reshape` flags

Fixes: https://github.com/scylladb/scylladb/issues/27299

Closes scylladb/scylladb#27311

(cherry picked from commit 1d5f60baac)

Closes scylladb/scylladb#27344
2025-12-21 17:49:10 +02:00
Łukasz Paszkowski
566dbd0b19 test_user_writes_rejection: Disable speculative retries
This test starts a 3-node cluster and creates a large blob file so that one
node reaches critical disk utilization, triggering write rejections on that
node. The test then writes data with CL=QUORUM and validates that the data:
- did not reach the critically utilized node
- did reach the remaining two nodes

By default, tables use speculative retries to determine when coordinators may
query additional replicas.

Since the validation uses CL=ONE, it is possible that an additional request
is sent to satisfy the consistency level. As a result:
- the first check may fail if the additional request is sent to a node that
  already contains data, making it appear as if data reached the critically
  utilized node
- the second check may fail if the additional request is sent to the critically
  utilized node, making it appear as if data did not reach the healthy node

The patch fixes the flakiness by disabling the speculative retries.

Fixes https://github.com/scylladb/scylladb/issues/27212

Closes scylladb/scylladb#27488

(cherry picked from commit 2cb9bb8f3a)

Closes scylladb/scylladb#27773
2025-12-21 17:48:32 +02:00
Avi Kivity
d507568eca Merge '[Backport 2025.4] db: repair: do not update repair_time if batchlog replay failed' from Scylladb[bot]
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

- (cherry picked from commit 502b03dbc6)

- (cherry picked from commit 904183734f)

- (cherry picked from commit 7f20b66eff)

- (cherry picked from commit e1b2180092)

- (cherry picked from commit d436233209)

- (cherry picked from commit 1935268a87)

- (cherry picked from commit 6fc43f27d0)

Parent PR: #26319

Closes scylladb/scylladb#26766

* github.com:scylladb/scylladb:
  repair: throw if flush failed in get_flush_time
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2025-12-21 14:14:13 +02:00
Radosław Cybulski
b104c80c8e Fix use-after-free in encode_paging_state in Alternator
Fix unlikely use-after-free in `encode_paging_state`. The function
incorrectly assumes that current position to encode will always have
data for all clustering columns the schema defines. It's possible to
encounter current position having less than all columns specified, for
eample in case of range tombstone. Those don't happen in Alternator
tables as DynamoDB doesn't allow range deletions and clustering key
might be of size at most 1. Alternator api can be used to read
scylla system tables and those do have range tombstones with more
than single clustering column.

The fix is to stop trying to encode columns, that don't have the value -
they are not needed anyway, as there's no possible position with those
values (range tombstone made sure of that).

Fixes #27001
Fixes #27125

Closes scylladb/scylladb#26960

(cherry picked from commit b54a9f4613)

Closes scylladb/scylladb#27347
2025-12-21 14:12:23 +02:00
Michael Litvak
14886c56fa test: fix test flakiness in test_colocated_tables_gc_mode
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.

To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.

Fixes scylladb/scylladb#27398

Closes scylladb/scylladb#27401

(cherry picked from commit 9213a163cb)

Closes scylladb/scylladb#27411
2025-12-19 17:32:12 +01:00
Michael Litvak
fbca8a7644 docs: document restrictions of colocated tables
Currently some things are not supported for colocated tables: it's not
possible to repair a colocated table, and due to this it's also not
possible to use the tombstone_gc=repair mode on a colocated table.

Extend the documentation to explain what colocated tables are and
document these restrictions.

Fixes scylladb/scylladb#27261

Closes scylladb/scylladb#27516

(cherry picked from commit 33f7bc28da)

Closes scylladb/scylladb#27772
2025-12-19 12:26:44 +01:00
Emil Maskovsky
6d81dc8ba8 topology_coordinator: handle seastar::abort_requested_exception alongside raft::request_aborted
In several exception handlers, only raft::request_aborted was being
caught and rethrown, while seastar::abort_requested_exception was
falling through to the generic catch(...) block. This caused the
exception to be incorrectly treated as a failure that triggers
rollback, instead of being recognized as an abort signal.

For example, during tablet draining, the error log showed:
"tablets draining failed with seastar::abort_requested_exception
(abort requested). Aborting the topology operation"

This change adds seastar::abort_requested_exception handling
alongside raft::request_aborted in all places where it was missing.
When rethrown, these exceptions propagate up to the main run() loop
where handle_topology_coordinator_error() recognizes them as normal
abort signals and allows the coordinator to exit gracefully without
triggering unnecessary rollback operations.

Fixes: scylladb/scylladb#27255

(cherry picked from commit 37e3dacf33)

Closes scylladb/scylladb#27663
2025-12-19 11:50:02 +01:00
Patryk Jędrzejczak
1cadf057ce Merge '[Backport 2025.4] Make direct failure detector verb handler more efficient' from Scylladb[bot]
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.

Fixes https://github.com/scylladb/scylladb/issues/27142

Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.

- (cherry picked from commit 82f80478b8)

- (cherry picked from commit 6a6bbbf1a6)

- (cherry picked from commit 86dde50c0d)

Parent PR: #27387

Closes scylladb/scylladb#27483

* https://github.com/scylladb/scylladb:
  direct_failure_detector: run direct failure detector in the gossiper scheduling group
  raft: drop invoke_on from the pinger verb handler
  direct_failure_detector: pass timeout to direct_fd_ping verb
2025-12-19 11:13:11 +01:00
Amnon Heiman
cbf6250021 scylla-node-exporter: Add ethtool to node exporter
AWS suggests following multiple network performance metrics:
https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/monitoring-network-performance-ena.html#network-performance-metrics

This patch enables the ethtool collector with the specific list of
metrics

Ater this patch the relevant metris looks like:

$ curl http://localhost:9100/metrics |& grep ethtool
node_ethtool_bw_in_allowance_exceeded{device="ens5"} 0
node_ethtool_bw_out_allowance_exceeded{device="ens5"} 0
node_ethtool_conntrack_allowance_available{device="ens5"} 51303
node_ethtool_conntrack_allowance_exceeded{device="ens5"} 0
node_ethtool_info{bus_info="0000:00:05.0",device="ens5",driver="ena",expansion_rom_version="",firmware_version="",version="6.14.0-1015-aws"} 1
node_ethtool_linklocal_allowance_exceeded{device="ens5"} 0
node_scrape_collector_duration_seconds{collector="ethtool"} 0.001091436
node_scrape_collector_success{collector="ethtool"} 1

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#27358

(cherry picked from commit a213e41250)

Closes scylladb/scylladb#27508
2025-12-19 09:15:51 +02:00
Pawel Pery
0d7a32cece unittest: fix vector_store_client_test_dns_refresh_aborted hangs
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.

The commit fixes the problem by waiting for the condition variable only once.

Fixes: #27237
Fixes: VECTOR-370

Closes scylladb/scylladb#27239

(cherry picked from commit b5c85d08bb)

Closes scylladb/scylladb#27393
2025-12-19 09:14:49 +02:00
Ernest Zaslavsky
4b81530e8a s3_client: handle additional transient network errors
Add handling for a broader set of transient network-related `std::errc` values in `aws_error::from_system_error`. Treat these conditions as retryable when the client re-creates the socket for each request.

Fixes: https://github.com/scylladb/scylladb/issues/27349

Closes scylladb/scylladb#27350

(cherry picked from commit 605f71d074)

Closes scylladb/scylladb#27392
2025-12-19 09:14:00 +02:00
Michael Litvak
34ede10db9 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312

(cherry picked from commit 97b7c03709)

Closes scylladb/scylladb#27331
2025-12-19 09:13:29 +02:00
Amnon Heiman
2e0c41b32b vector_index: require tablets for vector indexes
This patch enforces that vector indexes can only be created on keyspaces
that use tablets. During index validation, `check_uses_tablets()` verifies
the base keyspace configuration and rejects creation otherwise.

To support this, the `custom_index::validate()` API now receives a
`const data_dictionary::database&` parameter, allowing index
implementations to access keyspace-level settings during DDL validation.

Fixes https://scylladb.atlassian.net/browse/VECTOR-322

Closes scylladb/scylladb#26786

(cherry picked from commit 68c7236acb)

Closes scylladb/scylladb#27272
2025-12-19 09:12:18 +02:00
Amnon Heiman
9ad7bd8070 index/vector_index.cc: Don't allow zero as an index option
This patch forces vector_index option value to be real-positive numbers
as zero would make no senese.

Fixes https://scylladb.atlassian.net/browse/VECTOR-249

Signed-off-by: Amnon Heiman <amnon@scylladb.com>

Closes scylladb/scylladb#27191

(cherry picked from commit b2c2a99741)

Closes scylladb/scylladb#27234
2025-12-19 09:11:34 +02:00
Jenkins Promoter
7a365e0973 Update ScyllaDB version to: 2025.4.1 2025-12-18 18:36:06 +02:00
Jenkins Promoter
fe8b2f1092 Update ScyllaDB version to: 2025.4.0 2025-12-17 09:46:48 +02:00
Aleksandra Martyniuk
03ccf3915e repair: throw if flush failed in get_flush_time
Currently, _flush_time was stored as a std::optional<gc_clock::time_point>
and std::nullopt indicates that the flush was needed but failed. It's confusing
for the caller and does not work as expected since the _flush_time is initialized
with value (not optional).

Change _flush_time type to gc_clock::time_point. If a flush is needed but failed,
get_flush_time() throws an exception.

This was suppose to be a part of https://github.com/scylladb/scylladb/pull/26319
but it was mistakenly overwritten during rebases.

Refs: https://github.com/scylladb/scylladb/issues/24415.

Closes scylladb/scylladb#26794

(cherry picked from commit e3e81a9a7a)
2025-12-16 15:18:32 +01:00
Aleksandra Martyniuk
71cec8bff3 db: fix indentation
(cherry picked from commit 6fc43f27d0)
2025-12-16 15:18:20 +01:00
Aleksandra Martyniuk
a973209b32 test: add reproducer for data resurrection
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.

If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.

(cherry picked from commit 1935268a87)
2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk
9ad108acbf repair: fail tablet repair if any batch wasn't sent successfully
If any batch replay failed, we cannot update repair_time as we risk the
data resurrection.

If replay of any batch needs to be retried, run the whole repair but
fail at the very end, so that the repair_time for it won't be updated.

(cherry picked from commit d436233209)
2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk
be05e1b8e1 db/batchlog_manager: fix making decision to skip batch replay
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.

To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
    now() < written_at + min(timeout + propagation_delay)

repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
    repair_time <= now()

So we know that:
    repair_time < written_at + propagation_delay

With this condition we are sure that GC won't happen.

(cherry picked from commit e1b2180092)
2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk
767d8793b6 db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.

(cherry picked from commit 7f20b66eff)
2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk
34de8387a4 db/batchlog_manager: delete batch with incorrect or unknown version
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.

Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.

(cherry picked from commit 904183734f)
2025-12-16 15:11:05 +01:00
Aleksandra Martyniuk
3dd028f881 db/batchlog_manager: coroutinize replay_all_failed_batches
(cherry picked from commit 502b03dbc6)
2025-12-16 15:11:04 +01:00
Nadav Har'El
55b78e56a9 test/cqlpy: fix flaky test test_view_in_system_tables
The cqlpy test test_materialized_view.py::test_view_in_system_tables
checks that the system table "system.built_views" can inform us that
a view has been built. This test was flaky, starting to fail quite
often recently, and this patch fixes the problem in the test.

For historic reasons  this test began by calling a utility function
wait_for_view_built() - which uses a different system table,
system_distributed.view_build_status, to wait until the view was built.
The test then immediately tries to verify that also system.built_views
lists this view.

But there is no real reason why we could assume - or want to assume -
that these two tables are updated in this order, or how much time
passed between the two tables being changed. The authors of this
test already acknowledged there is a problem - they included a hack
purporting to be a "read barrier" that claimed to solve this exact
problem - but it seems it doesn't, or at least no longer does after
recent changes to the view builder's implementation.

The solution is simple - just remove the call to wait_for_view_built()
and the "hack" after it. We should just wait in a loop (until a timeout)
for the system table that we really wanted to check - system.built_views.
It's as simple as that. No need for any other assumptions or hacks.

Fixes #27296

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27626

(cherry picked from commit ccacea621f)

Closes scylladb/scylladb#27670
2025-12-16 15:45:59 +02:00
Calle Wilund
02908ca0f9 test::cluster::dtest::tools::files: Remove file
This contained only one routine; `corrupt_file`, which is
highly problematic, and not used. If you want to "corrupt" a
file, it should be done controlled, not at random.

(cherry picked from commit 8c4ac457af)
2025-12-16 09:27:27 +00:00
Calle Wilund
7266be67fb commitlog_replay: Handle fully corrupt files same as partial corruption.
Fixes #26744

If a segment to replay is broken such that the main header is not zero,
but still broken, we throw header_checksum_error. This was not handled in
replayer, which grouped this into the "user error/fundamental problem"
category. However, assuming we allow for "real" disk corruption, this should
really be treated same as data corruption, i.e. reported data loss, not
failure to start up.

The `test_one_big_mutation_corrupted_on_startup` test accidentally sometimes
provoked this issue, by doing random file wrecking, which on rare occasions
provoked this, and thus failed test due to scylla not starting up, instead
of loosing data as expected.

Changed test to consistently cause this exact error instead.

(cherry picked from commit e48170ca8e)
2025-12-16 09:27:26 +00:00
Calle Wilund
04862a600d test::pylib::suite::base: Split options.name test specifier only once
For some arcane reason, we split optional the test pattern given to
test.py twice across '::' to get the file + case specifiers later given
to pytest etc. This means that for a test with a class group (such as some
migrated dtests), we cannot really specify the exact test to run
(pattern <file>::<class>::test).

Simply splitting only on first '::' fixes this. Should not affect any
other tests.

(cherry picked from commit 9b5f3d12a3)
2025-12-16 09:27:26 +00:00
Michael Litvak
4b26a86cb0 alternator: require rf_rack_valid_keyspaces when creating index
When creating an alternator table with tablets, if it has an index, LSI
or GSI, require the config option rf_rack_valid_keyspaces to be enabled.

The option is required for materialized views in tablets keyspaces to
function properly and avoid consistency issues that could happen due to
cross-rack migrations and pairing switches when RF-rack validity is not
enforced.

Currently the option is validated when creating a materialized view via
the CQL interface, but it's missing from the alternator interface. Since
alternator indexes are based on materialized views, the same check
should be added there as well.

Fixes scylladb/scylladb#27612

Closes scylladb/scylladb#27622

(cherry picked from commit b9ec1180f5)

Closes scylladb/scylladb#27671
2025-12-16 10:13:31 +02:00
Jenkins Promoter
344f648703 Update pgo profiles - aarch64 2025-12-15 10:31:11 +02:00
Yaron Kaikov
b70794e6ed auto-backport.py: modify instruction for making PR ready for review
Update the comment sent when PR has conflicts with clear instrauctions how to make the PR Ready for review

Fixes: https://scylladb.atlassian.net/browse/RELENG-152

Closes scylladb/scylladb#27547

(cherry picked from commit d3e199984e)

Closes scylladb/scylladb#27565
2025-12-15 09:59:44 +02:00
Yaron Kaikov
3c7ff856c3 workflows: trigger CI automatically when conflicts label is removed
Add pull_request_target event with unlabeled type to trigger-scylla-ci
workflow. This allows automatic CI triggering when the 'conflicts' label
is removed from a PR, in addition to the existing manual trigger via
comment.

The workflow now runs when:
- A user posts a comment with '@scylladbbot trigger-ci' (existing)
- The 'conflicts' label is removed from a PR (new)

Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-84

Closes scylladb/scylladb#27521

(cherry picked from commit f7ffa395a8)

Closes scylladb/scylladb#27602
2025-12-15 09:58:52 +02:00
Jenkins Promoter
99e9f4b07c Update pgo profiles - x86_64 2025-12-15 09:42:11 +02:00
Botond Dénes
ffc6953850 Merge '[Backport 2025.4] api: storage_service/tablets/repair: disable incremental repair by default' from Scylladb[bot]
Change the default incremental_mode to `disabled` due to https://github.com/scylladb/scylladb/issues/26041 and https://github.com/scylladb/scylladb/issues/27414

** Backport to 2025.4 where 611918056a was introduced **

- (cherry picked from commit 5fae4cdf80)

- (cherry picked from commit c8cff94a5a)

Parent PR: #27530

Closes scylladb/scylladb#27595

* github.com:scylladb/scylladb:
  api: storage_service/tablets/repair: disable incremental repair by default
  docs: nodetool-commands: cluster: repair: fix incremental-mode example
2025-12-15 08:47:49 +02:00
Yaron Kaikov
434956af0f Add JIRA issue validation to backport PR fixes check
Extend the Fixes validation pattern to also accept JIRA issue references
(format: [A-Z]+-\d+) in addition to GitHub issue references. This allows
backport PRs to reference JIRA issues in the format 'Fixes: PROJECT-123'.

Fixes: https://github.com/scylladb/scylladb/issues/27571

Closes scylladb/scylladb#27572

(cherry picked from commit 3dfa5ebd7f)

Closes scylladb/scylladb#27601
2025-12-12 09:34:45 +02:00
Benny Halevy
2f4f3ff980 api: storage_service/tablets/repair: disable incremental repair by default
Change the default incremental_mode to `disabled` due to
https://github.com/scylladb/scylladb/issues/26041 and
https://github.com/scylladb/scylladb/issues/27414

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit c8cff94a5a)
2025-12-11 23:49:07 +00:00
Benny Halevy
20134b9ade docs: nodetool-commands: cluster: repair: fix incremental-mode example
There is no 'regular' incremental mode anymore.
The example seems have meant 'disabled'.

Fixes #27587

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
(cherry picked from commit 5fae4cdf80)
2025-12-11 23:49:07 +00:00
Dario Mirovic
57b90034c6 test: cqlpy: test_protocol_exceptions.py: enable debug exception logging
Enable debug logging for "exception" logger inside protocol exception tests.
The exceptions will be logged, and it will be possible to see which ones
occured if a protocol exceptions test fails.

Refs #27272
Refs #27325

(cherry picked from commit c30b326033)
2025-12-10 14:21:24 +00:00
Dario Mirovic
d8cc029a5e test: cqlpy: test_protocol_exceptions.py: increase cpp exceptions threshold
The initial problem:

Some of the tests in test_protocol_exceptions.py started failing. The failure is
on the condition that no more than `cpp_exception_threshold` happened.

Test logic:

These tests assert that specific code paths do not throw an exception anymore.
Initial implementation ran a code path once, and asserted there were 0 exceptions.
Sometimes an exception or several can occur, not directly related to the code paths
the tests check, but those would fail the tests.

The solution was to run the tests multiple times. If there is a regression, there
would be at least as many exceptions thrown as there are test runs. If there is no
regression, a few exceptions might happen, up to 10 per 100 test runs.
I have arbitrarily chosen `run_count = 100` and `cpp_exception_threshold = 10` values.

Note that the exceptions are counted per shard, not per code path.

The new problem:

The occassional exceptions thrown by some parts of the server now throw a bit more
than before. Based on the logs linked on the issues, it is usually 12.

There are possibly multiple ways to resolve the issue. I have considered logging
exceptions and parsing them. I would have to filter exception logs only for wanted
exceptions. However, if a new, different exception is introduced, it might not be
counted.

Another approach is to just increase the threshold a bit. The issue of throwing
more exceptions than before in some other server modules should be addressed by
a set of tests for that module, just like these tests check protocol exceptions,
not caring who used protocol check code paths.

For those reasons, the solution implemented here is to increase `cpp_exception_threshold`
to `20`. It will not make the tests unreliable, because, as mentioned, if there is a
regression, there would be at least `run_count` exceptions per `run_count` test runs
(1 exception per single test run).

Still, to make "background exceptions" occurence a bit more normalized, `run_count` too
is doubled, from `100` to `200`. At the first glance this looks like nothing is changed,
but actually doubling both run count and exception threshold here implies that the
exception burst does not scale as much as run count, it is just that the "jitter" is
bigger than the old threshold.

Fixes #27247
Fixes #27325

(cherry picked from commit 807fc68dc5)
2025-12-10 14:21:23 +00:00
Petr Gusev
14fea2c5ba alternator/executor.cc: eliminate redundant dk copy
A small refactoring/optimization.

(cherry picked from commit 608eee0357)
2025-12-10 11:49:04 +01:00
Petr Gusev
e5223415e4 alternator/executor.cc: release cas_shard on the original shard
Before this series, we kept the cas_shard on the original shard to
guard against tablet movements running in parallel with
storage_proxy::cas.

The bug addressed by this PR shows that this approach is flawed:
keeping the cas_shard on the original shard does not guarantee that
a new cas_shard acquired on the target shard won’t require another
jump.

We fixed this in the previous commit by checking cas_shard.this_shard()
on the target shard and continuing to jump to another shard if
necessary. Once cas_shard.this_shard() on the target shard returns
true, the storage_proxy::cas invariants are satisfied, and no other
cas_shard instances need to remain alive except the one passed
into storage_proxy::cas.

(cherry picked from commit 0bcc2977bb)
2025-12-10 11:49:04 +01:00
Petr Gusev
8cf9c3eb8d alternator/executor.cc: move shard check into cas_write
This change ensures that if cas_shard points to a different shard,
the executor will continue issuing shard jumps until
cas_shard.this_shard() returns true. The commit simply moves the
this_shard() check from the parallel_for_each lambda into cas_write,
with minimal functional changes.

We enable test_alternator_invalid_shard_for_lwt since now it should
pass.

Fixes scylladb/scylladb#27353

(cherry picked from commit 3a865fe991)
2025-12-10 11:49:04 +01:00
Petr Gusev
5ceabe90fc alternator/executor.cc: make cas_write a private method
We will need to access executor::_stats field from cas_write. We could
pass it as a paramter, but it seems simpler to just make cas_write
and instance method too.

(cherry picked from commit c6eec4eeef)
2025-12-10 11:49:04 +01:00
Petr Gusev
c90578a1fb alternator/executor.cc: make do_batch_write a private method
We will need to access executor::_stats field on other shards.

(cherry picked from commit 9bef142328)
2025-12-10 11:44:52 +01:00
Petr Gusev
bdc43c62b1 alternator/executor.cc: fix indent
(cherry picked from commit 74bf24a4a7)
2025-12-10 11:43:02 +01:00
Petr Gusev
37c575c104 test_alternator: add test_alternator_invalid_shard_for_lwt
This test reproduces scylladb/scylladb#27353 using two injection
points. First, the test triggers an intra-node tablet migration and
suspends it at the streaming stage using the
intranode_migration_streaming_wait injection. Next, it enables the
alternator_executor_batch_write_wait injection, which suspends a
batch write after its cas_shard has already been created.
The test then issues several batch writes and waits until one of them
hits this injection on the destination shard. At this point, the
cas_shard.erm for that write is still in the streaming state,
meaning the executor would need to jump back to the source shard.
The test then resumes the suspended tablet migration, allowing it to
update the ERM on the source shard to write_both_read_new. After that,
the test releases the suspended batch write and expects it to perform
two shard jumps: first from the destination to the source shard, and
then again back to the source shard.

This commit adds the alternator_executor_batch_write_wait injection to
alternator/executor.cc. Coroutines are intentionally avoided in the
parallel_for_each lambda to prevent unnecessary coroutine-frame
allocations.

(cherry picked from commit e60bcd0011)
2025-12-10 11:42:54 +01:00
Petr Gusev
b88ed6156b alternator/executor.cc: avoid cross-shard free
This commit is an optimization: avoiding destruction of
foreign objects on the wrong shard. Releasing objects allocated on a
different shard causes their ::free calls to be executed remotely,
which adds unnecessary load to the SMP subsystem.

Before this patch, a std::vector could be moved
to another shard. When the vector was eventually destroyed,
its ::free had to be marshalled back to the shard where the memory had
originally been allocated. This change avoids that overhead by passing
the vector by const reference instead.

The referenced objects lifetime correctness reasoning:
* the put_or_delete_item refs usages in put_or_delete_item_cas_request
are bound to its lifetime
* cas_request lifetime is bound to storage_proxy::cas future
* we don't release put_or_delete_item-s untill all storage_proxy::cas
calls are done.

(cherry picked from commit f00f7976c1)
2025-12-10 11:41:55 +01:00
Anna Stuchlik
cc885a3f35 replace the Driver pages with a link to the new Drivers pages
This commit removes the now redundant driver pages from
the Scylla DB documentation. Instead, the link to the pages
where we moved the diver information is added.
Also, the links are updated across the ScyllaDB manual.

Redirections are added for all the removed pages.

Fixes https://github.com/scylladb/scylladb/issues/26871

Closes scylladb/scylladb#27277

(cherry picked from commit c5580399a8)

Closes scylladb/scylladb#27442
2025-12-10 09:18:24 +01:00
Jenkins Promoter
582b9f83db Update ScyllaDB version to: 2025.4.0-rc7 2025-12-09 17:21:23 +02:00
Gleb Natapov
726c1f5734 direct_failure_detector: run direct failure detector in the gossiper scheduling group
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.

(cherry picked from commit 86dde50c0d)
2025-12-09 17:19:31 +02:00
Anna Stuchlik
d2c24bf42d doc: update the upgrade policy to cover non-consecutive minor upgrades
Fixes https://github.com/scylladb/scylladb/issues/27308

Closes scylladb/scylladb#27319

(cherry picked from commit a5c971d21c)

Closes scylladb/scylladb#27457
2025-12-09 11:38:00 +03:00
Anna Stuchlik
72ee6396d2 doc: add the upgrade guide from 2025.x to 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26451

Fixes https://github.com/scylladb/scylladb/issues/26452

Closes scylladb/scylladb#27310

(cherry picked from commit 48cf84064c)

Closes scylladb/scylladb#27407
2025-12-09 11:37:27 +03:00
Jenkins Promoter
f64b5d375e Update ScyllaDB version to: 2025.4.0-rc6 2025-12-08 13:37:52 +02:00
Gleb Natapov
6427044681 raft: drop invoke_on from the pinger verb handler
Currently raft direct pinger verb jumps to shard 0 to check if group0 is
alive before replying. The verb runs relatively often, so it is not very
efficient. The patch distributes group0 liveness information (as it
changes) to all shard instead, so that the handler itself does not need
to jump to shard 0.

(cherry picked from commit 6a6bbbf1a6)
2025-12-07 14:57:50 +00:00
Gleb Natapov
000ac0227e direct_failure_detector: pass timeout to direct_fd_ping verb
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.

(cherry picked from commit 82f80478b8)
2025-12-07 14:57:50 +00:00
Piotr Dulikowski
3040e7aedf index: allow vector indexes without rf_rack_valid_keyspces
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.

Remove the restriction for vector store indexes as it is completely
unnecessary.

Fixes: SCYLLADB-81

Closes scylladb/scylladb#27447

(cherry picked from commit bb6e41f97a)

Closes scylladb/scylladb#27455
2025-12-05 20:13:02 +01:00
Karol Nowacki
71c47b8d18 vector_search: Fix requests hanging on unreachable nodes
When a vector store node becomes unreachable, a client request sent
before the keep-alive timer fires would hang until the CQL query
timeout was reached.

This occurred because the HTTP request writes to the TCP buffer and then
waits for a response. While data is in the buffer, TCP retransmissions
prevent the keep-alive timer from detecting the dead connection.

This patch resolves the issue by setting the `TCP_USER_TIMEOUT` socket
option, which applies an effective timeout to TCP retransmissions,
allowing the connection to fail faster.

Closes scylladb/scylladb#27388

(cherry picked from commit a54bf50290)

Closes scylladb/scylladb#27423
2025-12-04 19:54:08 +01:00
Avi Kivity
4db6d3e924 database: fix overflow when computing data distribution over shards
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:

```c++
        // [1, 2, 3, 0] --> [0, 1, 3, 6]
        std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```

However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.

As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.

An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57

The fix is simple: the initial value is passed as uint64_t instead of int.

Fixes https://github.com/scylladb/scylladb/issues/27417

Closes scylladb/scylladb#27418

(cherry picked from commit 9696ee64d0)
2025-12-04 20:17:19 +02:00
Łukasz Paszkowski
3a7f56c8ce topology_coordinator: Fix the indentation for the cleanup_target case
(cherry picked from commit 6163fedd2e)
2025-12-04 03:57:51 +00:00
Łukasz Paszkowski
b3ce1e8c86 topology_coordinator: Add barrier to cleanup_target
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512

(cherry picked from commit 67f1c6d36c)
2025-12-04 03:57:50 +00:00
Łukasz Paszkowski
1ad0d9fafc test_node_failure_during_tablet_migration: Increase RF from 2 to 3
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.

To address it, the third rack with a single node is added and the
replication factor is increased to 3.

(cherry picked from commit 669286b1d6)
2025-12-04 03:57:50 +00:00
Piotr Dulikowski
e4b1c1f38b db/view/view_building_coordinator: skip work if no view is built
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.

Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.

Fixes: scylladb/scylladb#27363

Closes scylladb/scylladb#27373

(cherry picked from commit 654ac9099b)

Closes scylladb/scylladb#27406
2025-12-03 17:12:17 +01:00
Piotr Dulikowski
2787ac6cba Merge '[Backport 2025.4] vector_search: Fix high availability during timeouts' from Scylladb[bot]
This PR introduces two key improvements to the robustness and resource management of vector search:

Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.

Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.

Fixes: SCYLLADB-76

This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.

- (cherry picked from commit b6afacfc1e)

- (cherry picked from commit 086c6992f5)

Parent PR: #27377

Closes scylladb/scylladb#27391

* github.com:scylladb/scylladb:
  vector_search: Fix ANN query abort on CQL timeout
  vector_search: Reduce connection and keep-alive timeouts
2025-12-03 07:20:11 +01:00
Karol Nowacki
26599e79f2 vector_search: Fix ANN query abort on CQL timeout
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
2025-12-02 16:58:55 +01:00
Karol Nowacki
d4c199a1ec vector_search: Reduce connection and keep-alive timeouts
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.

This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
2025-12-02 16:52:53 +01:00
Asias He
4e7202ee32 repair: Fix deadlock when topology coordinator steps down in the middle
Consider this:

1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.

To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.

Fixes #26346

Closes scylladb/scylladb#27163

(cherry picked from commit da5cc13e97)

Closes scylladb/scylladb#27337
2025-12-01 13:06:02 +01:00
Jenkins Promoter
6b5d334be3 Update pgo profiles - aarch64 2025-12-01 04:47:40 +02:00
Jenkins Promoter
58f1597831 Update pgo profiles - x86_64 2025-12-01 03:56:10 +02:00
Anna Stuchlik
d9bfb8c607 doc: fix the info about object storage
This commit fixes the information about object storage:

- Object storage configuration is no longer marked as experimental.
- Redundant information has been removed from the description.
- Information related to object storage for SStabels has been removed
  as the feature is not working.

Fixes https://github.com/scylladb/scylladb/issues/26985

Closes scylladb/scylladb#26987

(cherry picked from commit 724dc1e582)

Closes scylladb/scylladb#27211
2025-11-28 12:38:08 +01:00
Patryk Jędrzejczak
1dab04666c Merge '[Backport 2025.4] doc: update Cloud Instance Recommendations for GCP' from Scylladb[bot]
This PR:
- Removes n1-highmem instances from Recommended Instances.
- Adds missing support for n2-highmem-96.
- Updates the reference to n2 instances in the Google Cloud docs (fixes a broken link to GCP).
- Adds the missing information about processors for n2-highmem-instance - Ice Lake and Cascade Lake (requested by CX).

Fixes https://github.com/scylladb/scylladb/issues/25946
Fixes https://github.com/scylladb/scylladb/issues/24223
Fixes https://github.com/scylladb/scylladb/issues/23976

No backport needed if this PR is merged before 2025.4 branching.

- (cherry picked from commit b18b052d26)

- (cherry picked from commit dab74471cc)

Parent PR: #26182

Closes scylladb/scylladb#27168

* https://github.com/scylladb/scylladb:
  doc: update information for n2-highmem instances
  doc: remove n1-highmem instances from Recommended Instances
2025-11-28 12:31:33 +01:00
Asias He
fc54aedd8f topology_coordinator: Send incremental repair rpc only when the feature is enabled
Otherwise, in a mixed cluster, the handle_tablet_resize_finalization
would fail because of the unknown rpc verb.

Fixes #26309

Closes scylladb/scylladb#27218

(cherry picked from commit ab4896dc70)

Closes scylladb/scylladb#27284
2025-11-27 18:42:14 +01:00
Patryk Jędrzejczak
6b3b05c10b Merge '[Backport 2025.4] fix notification about expiring erm held for to long' from Scylladb[bot]
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

- (cherry picked from commit 9f97c376f1)

- (cherry picked from commit 5dcdaa6f66)

Parent PR: #27140

Closes scylladb/scylladb#27276

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-27 16:58:16 +01:00
Patryk Jędrzejczak
30e02b6658 Merge '[Backport 2025.4] locator/node: include _excluded in missing places' from Scylladb[bot]
We currently ignore the `_excluded` field in `node::clone()` and the verbose
formatter of `locator::node`. The first one is a bug that can have
unpredictable consequences on the system. The second one can be a minor
inconvenience during debugging.

We fix both places in this PR.

Fixes https://scylladb.atlassian.net/browse/SCYLLADB-72

This PR is a bugfix that should be backported to all supported branches.

- (cherry picked from commit 4160ae94c1)

- (cherry picked from commit 287c9eea65)

Parent PR: #27265

Closes scylladb/scylladb#27291

* https://github.com/scylladb/scylladb:
  locator/node: include _excluded in verbose formatter
  locator/node: preserve _excluded in clone()
2025-11-27 12:27:06 +01:00
Nadav Har'El
a0916179a3 Merge '[Backport 2025.4] Alternator: enable tablets by default - depending on tablets_mode_for_new_keyspaces' from Scylladb[bot]
Before this series, Alternator's CreateTable operation defaults to creating a table replicated with vnodes, not tablets. The reasons for this default included missing support for LWT, Materialized Views, Alternator TTL and Alternator Streams if tablets are used. But today, all of these (except the still-experimental Alternator Streams) are now fully available with tablets, so we are finally ready to switch Alternator to use tablets by default in new tables.

We will use the same configuration parameter that CQL uses, tablets_mode_for_new_keyspaces, to determine whether new keyspaces use tablets by default. If set to `enabled`, tablets are used by default on new tables. If set to `disabled`, tablets will not be used by default (i.e., vnodes will be used, as before). A third value, `enforced` is similar to `enabled` but forbids overriding the default to vnodes when creating a table.

As before, the user can set a tag during the CreateTable operation to override the default choice of tablets or vnodes (unless in `enforced` mode). This tag is now named `system:initial_tablets` - whereas before this patch it was called `experimental:initial_tablets`. The rules stay the same as with the earlier, experimental:initial_tablets tag: when supplied with a numeric value, the table will use tablets. When supplied with something else (like a string "none"), the table will use vnodes.

Fixes https://github.com/scylladb/scylladb/issues/22463

Backport to 2025.4, it's important not to delay phasing out vnodes.

- (cherry picked from commit 403068cb3d)

- (cherry picked from commit af00b59930)

- (cherry picked from commit 376a2f2109)

- (cherry picked from commit 35216d2f01)

- (cherry picked from commit 7466325028)

- (cherry picked from commit c7de7e76f4)

- (cherry picked from commit 63897370cb)

- (cherry picked from commit 274d0b6d62)

- (cherry picked from commit 345747775b)

- (cherry picked from commit a659698c6d)

- (cherry picked from commit eeb3a40afb)

- (cherry picked from commit b34f28dae2)

- (cherry picked from commit 25439127c8)

- (cherry picked from commit c03081eb12)

- (cherry picked from commit 65ed678109)

Parent PR: #26836

Closes scylladb/scylladb#26949

* github.com:scylladb/scylladb:
  test/cluster: modify test to not fail on 2025.4 branch
  Fix backport conflicts
  test,alternator: use 3-rack clusters in tests
  alternator: improve error in tablets_mode_for_new_keyspaces=enforced
  config: make tablets_mode_for_new_keyspaces live-updatable
  alternator: improve comment about non-hidden system tags
  alternator: Fix test_ttl_expiration_streams()
  alternator: Fix test_scan_paging_missing_limit()
  alternator: Don't require vnodes for TTL tests
  alternator: Remove obsolete test from test_table.py
  alternator: Fix tag name to request vnodes
  alternator: Fix test name clash in test_tablets.py
  alternator: test_tablets.py handles new policy reg. tablets
  alternator: Update doc regarding tablets support
  alternator: Support `tablets_mode_for_new_keyspaces` config flag
  Fix incorrect hint for tablets_mode_for_new_keyspaces
  Fix comment for tablets_mode_for_new_keyspaces
2025-11-27 09:05:18 +02:00
Piotr Dulikowski
863aae84fd Merge '[Backport 2025.4] db/view/view_building_coordinator: get rid of task's state in group0' from Scylladb[bot]
Previously, the view building coordinator relied on setting each task's state to STARTED and then explicitly removing these state entries once tasks finished, before scheduling new ones. This approach induced a significant number of group0 commits, particularly in large clusters with many nodes and tablets, negatively impacting performance and scalability.

With the update, the coordinator and worker logic has been restructured to operate without maintaining per-task states. Instead, tasks are simply tracked with an aborted boolean flag, which is still essential for certain tablet operations. This change removes much of the coordination complexity, simplifies the view building code, and reduces operational overhead.

In addition, the coordinator now batches reports of finished tasks before making commits. Rather than committing task completions individually, it aggregates them and reports in groups, significantly minimizing the frequency of group0 commits. This new approach is expected to improve efficiency and scalability during materialized view construction, especially in large deployments.

Fixes https://github.com/scylladb/scylladb/issues/26311

This patch needs to be backported to 2025.4.

- (cherry picked from commit 6d853c8f11)

- (cherry picked from commit 08974e1d50)

- (cherry picked from commit eb04af5020)

- (cherry picked from commit 24d69b4005)

- (cherry picked from commit fb8cbf1615)

- (cherry picked from commit fe9581f54c)

Parent PR: #26897

Closes scylladb/scylladb#27266

* github.com:scylladb/scylladb:
  docs/dev/view-building-coordinator: update the docs after recent changes
  db/view/view_building: send coordinator's term in the RPC
  db/view/view_building_state: replace task's state with `aborted` flag
  db/view/view_building_coordinator: batch finished tasks reporting
  db/view/view_building_worker: change internal implementation
  db/view/view_building_coordinator: change `work_on_tasks` RPC return type
2025-11-27 01:47:53 +01:00
Patryk Jędrzejczak
3c635037df locator/node: include _excluded in verbose formatter
It can be helpful during debugging.

(cherry picked from commit 287c9eea65)
2025-11-26 23:05:25 +00:00
Patryk Jędrzejczak
30790b9af4 locator/node: preserve _excluded in clone()
We currently ignore the `_excluded` field in `clone()`. Losing
information about exclusion can have unpredictable consequences. One
observed effect (that led to finding this issue) is that the
`/storage_service/nodes/excluded` API endpoint sometimes misses excluded
nodes.

(cherry picked from commit 4160ae94c1)
2025-11-26 23:05:25 +00:00
Nadav Har'El
b2c3b28617 test/cluster: modify test to not fail on 2025.4 branch
The purpose of the test

cluster/test_alternator::test_alternator_ttl_scheduling_group

Is to verify that during TTL expiration scans and deletions, all of the
CPU is used in the "streaming" scheduling group, not in the statement
scheduling group ("sl:default") as we had in the past due to bugs.

It appears that in branch 2025.4 we have a new bug - which doesn't exist
in master - that causes some tablets-related work which I couldn't
identify to be done in sl:default, and cause this test to fail.

The simple fix is to sleep for 5 seconds after writing the data, and
it seems that by that time, the sl:default work is done.

This change doesn't make the Alternator TTL test any weaker, so we
need to make this change to allow Alternator to go forward.

Sadly, it does mean that the only test we have for this apparent
bug (which has nothing to do with Alternator) will be gone.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-26 20:42:09 +02:00
Nadav Har'El
433bc4c17f Fix backport conflicts 2025-11-26 20:42:08 +02:00
Nadav Har'El
5e48fb9601 test,alternator: use 3-rack clusters in tests
With tablets enabled, we can't create an Alternator table on a three-
node cluster with a single rack, since Scylla refuses RF=3 with just
one rack and we get the error:

    An error occurred (InternalServerError) when calling the CreateTable
    operation: ... Replication factor 3 exceeds the number of racks (1) in
    dc datacenter1

So in test/cluster/test_alternator.py we need to use the incantation
"auto_rack_dc='dc1'" every time that we create a three-node cluster.

Before this patch, several tests in test/cluster/test_alternator.py
failed on this error, with this patch all of them pass.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 65ed678109)
2025-11-26 20:42:08 +02:00
Nadav Har'El
a5ba028d40 alternator: improve error in tablets_mode_for_new_keyspaces=enforced
When in tablets_mode_for_new_keyspaces=enforced mode, Alternator is
supposed to fail when CreateTable asks explicitly for vnodes. Before
this patch, this error was an ugly "Internal Server Error" (an
exception thrown from deep inside the implementation), this patch
checks for this case in the right place, to generate a proper
ValidationException with a proper error message.

We also enable the test test_tablets_tag_vs_config which should have
caught this error, but didn't because it was marked xfail because
tablets_mode_for_new_keyspaces had not been live-updatable. Now that
it is, we can enable the test. I also improved the test to be slightly
faster (no need to change the configuration so many times) and also
check the ordinary case - where the schema doesn't choose neither
vnodes nor tablets explicitly and we should just use the default.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit c03081eb12)
2025-11-26 20:42:08 +02:00
Nadav Har'El
5972790a71 config: make tablets_mode_for_new_keyspaces live-updatable
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.

For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.

In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit 25439127c8)
2025-11-26 20:42:08 +02:00
Nadav Har'El
58b89b4d28 alternator: improve comment about non-hidden system tags
The previous patches added a somewhat misleading comment in front of
system:initial_tablets, which this patch improves.

That tag is NOT where Alternator "stores" table properties like the
existing comment claimed. In fact, the whole point is that it's the
opposite - Alternator never writes to this tag - it's a user-writable
tag which Alternator *reads*, to configure the new table. And this is
why it obviously can't be hidden from the user.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
(cherry picked from commit b34f28dae2)
2025-11-26 20:42:08 +02:00
Piotr Szymaniak
915fa6694b alternator: Fix test_ttl_expiration_streams()
The test is now aware of the new name of the
`system:initial_tablets` tag.

(cherry picked from commit eeb3a40afb)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
42dc583467 alternator: Fix test_scan_paging_missing_limit()
With tablets, the test begun failing. The failure was correlated with
the number of initial tablets, which when kept at default, equals
4 tablets per shard in release build and 2 tablets per shard in dev
build.

In this patch we split the test into two - one with a more data in
the table to check the original purpose of this test - that Scan
doesn't return the entire table in one page if "Limit" is missing.
The other test reproduces issue #10327 - that when the table is
small, Scan's page size isn't strictly limited to 1MB as it is in
DynamoDB.

Experimentally, 8000 KB of data (compared to 6000 KB before this patch)
is enough when we have up to 4 initial tablets per shard (so 8 initial
tablets on a two-shard node as we typically run in tests).

Original patch by Piotr Szymaniak <piotr.szymaniak@scylladb.com>
modified by Nadav Har'El <nyh@scylladb.com>

(cherry picked from commit a659698c6d)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
80fc6a7951 alternator: Don't require vnodes for TTL tests
Since #23662 Alternator supports TTL with tablets too. Let's clear some
leftovers causing Alternator to test TTL with vnodes instead of with
what is default for Alternator (tablets or vnodes).

(cherry picked from commit 345747775b)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
44f46099fd alternator: Remove obsolete test from test_table.py
Since Alternator is capable of runnng with tablets according to the
flag in config, remove the obsolete test that is making sure
that Alternator runs with vnodes.

(cherry picked from commit 274d0b6d62)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
7489db0097 alternator: Fix tag name to request vnodes
The tag was lately renamed from `experimental:initial_tablets` to
`system::initial_tablets`. This commit fixes both the tests as well as
the exceptions sent to the user instructing how to create table with
vnodes.

(cherry picked from commit 63897370cb)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
2b3f921971 alternator: Fix test name clash in test_tablets.py
(cherry picked from commit c7de7e76f4)
2025-11-26 20:42:07 +02:00
Piotr Szymaniak
e286a72fd8 alternator: test_tablets.py handles new policy reg. tablets
Adjust the tests so they are in-line with the config flag
'tablets_mode_for_new_keyspaces` that the Alternator learned to honour.

(cherry picked from commit 7466325028)
2025-11-26 20:42:06 +02:00
Piotr Szymaniak
04363b86ea alternator: Update doc regarding tablets support
Reflect honouring by Alternator the value of the config flag
`tablets_mode_for_new_keyspaces`, as well as renaming of the tag
`experimental:initial_tablets` into `system:initial_tablets`.

(cherry picked from commit 35216d2f01)
2025-11-26 20:42:06 +02:00
Piotr Szymaniak
65ceb83f42 alternator: Support tablets_mode_for_new_keyspaces config flag
Until now, tablets in Alternator were experimental feature enabled only
when a TAG "experimental:initial_tablets" was present when creating a
table and associated with a numeric value.

After this patch, Alternator honours the value of
`tablets_mode_for_new_keyspaces` config flag.

Each table can be overriden to use tablets or not by supplying a new TAG
"system:initial_tablets". The rules stay the same as with the earlier,
experimental tag: when supplied with a numeric value, the table will use
tablets (as long as they are supported). When supplied with something
else (like a string "none"), the table will use vnodes, provided that
tablets are not `enforced` by the config flag.

Fixes #22463

(cherry picked from commit 376a2f2109)
2025-11-26 20:42:06 +02:00
Piotr Szymaniak
8b9de27ff8 Fix incorrect hint for tablets_mode_for_new_keyspaces
(cherry picked from commit af00b59930)
2025-11-26 20:42:06 +02:00
Piotr Szymaniak
b000904d0f Fix comment for tablets_mode_for_new_keyspaces
The comment was not listing all the 3 possible values correctly,
despite an explanation just below covers all 3 values.

(cherry picked from commit 403068cb3d)
2025-11-26 20:42:06 +02:00
Michał Jadwiszczak
88d55e9236 docs/dev/view-building-coordinator: update the docs after recent changes
Remove information about view building task state and explain how
current lifetime of the task.

(cherry picked from commit fe9581f54c)
2025-11-26 17:47:16 +01:00
Michał Jadwiszczak
64e0405ba2 db/view/view_building: send coordinator's term in the RPC
To avoid case when an old coordinator (which hasn't been stopped yet)
dictates what should be done, add raft term to the `work_on_view_building_tasks`
RPC.
The worker needs to check if the term matches the current term from raft
server, and deny the request when the term is bad.

(cherry picked from commit fb8cbf1615)
2025-11-26 17:47:16 +01:00
Michał Jadwiszczak
0ffc8c5987 db/view/view_building_state: replace task's state with aborted flag
After previous commits, we can drop entire task's state and replace it
with single boolean flag, which determines if a task was aborted.

Once a task was aborted, it cannot get resurrected to a normal state.

(cherry picked from commit 24d69b4005)
2025-11-26 17:47:16 +01:00
Michał Jadwiszczak
098082a8d9 db/view/view_building_coordinator: batch finished tasks reporting
In previous implementation to execute view building tasks, the
coordinator needed to firstly set their states to `STARTED`
and then it needed to remove them before it could start the next ones.
This logic required a lot of group0 commits, especially in large
clusters with higher number of nodes and big tablet count.

After previous commit to the view building worker, the coordinator
can start view building tasks without setting the `STARTED` state
and deleting finished tasks.

This patch adjusts the coordinator to save finished tasks locally,
so it can continue to execute next ones and the finished tasks are
periodically removed from the group0 by `finished_task_gc_fiber()`.

(cherry picked from commit eb04af5020)
2025-11-26 17:47:12 +01:00
Gleb Natapov
26606c8801 test: test that expired erm that held for too long triggers notification
(cherry picked from commit 5dcdaa6f66)
2025-11-26 15:09:15 +00:00
Gleb Natapov
f29911cb73 token_metadata: fix notification about expiring erm held for to long
Commit 6e4803a750 broke notification about expired erms held for too
long since it resets the tracker without calling its destructor (where
notification is triggered). Fix assign operator to call destructor.

(cherry picked from commit 9f97c376f1)
2025-11-26 15:09:15 +00:00
Wojciech Mitros
33eef5122c alternator: use storage_proxy from the correct shard in executor::delete_table
When we delete a table in alternator, the schema change is performed on shard 0.
However, we actually use the storage_proxy from the shard that is handling the
delete_table command. This can lead to problems because some information is
stored only on shard 0 and using storage_proxy from another shard may make
us miss it.
In this patch we fix this by using the storage_proxy from shard 0 instead.

Fixes https://github.com/scylladb/scylladb/issues/27223

Closes scylladb/scylladb#27224

(cherry picked from commit 3c376d1b64)

Closes scylladb/scylladb#27260
2025-11-26 14:33:21 +02:00
Michał Jadwiszczak
ab9878c2df db/view/view_building_worker: change internal implementation
This commit doesn't change the logic behind the view building worker but
it changes how the worker is executing view building tasks.

Previously, the worker had a state only on shard0 and it was reacting to
changes in group0 state. When it noticed some tasks were moved to
`STARTED` state, the worker was creating a batch for it on the shard0
state.
The RPC call was used only to start the batch and to get its result.

Now, the main logic of batch management was moved to the RPC call
handler.
The worker has a local state on each shard and the state
contains:
- unique ptr to the batch
- set of completed tasks
- information for which views the base table was flushed

So currently, each batch lives on a shard where it has its work to do
exclusively. This eliminates a need to do a synchronization between
shard0 and work shard, which was a painful point in previous
implementation.

The worker still reacts to changes in group0 view building state, but
currently it's only used to observe whether any view building tasks was
aborted by setting `ABORTED` state.

To prepare for further changes to drop the view building task state,
the worker ignores `IDLE` and `STARTED` states completely.

(cherry picked from commit 08974e1d50)
2025-11-26 12:24:42 +00:00
Michał Jadwiszczak
8df72bb658 db/view/view_building_coordinator: change work_on_tasks RPC return type
During the initial implementation of the view builing coordinator,
we decided that if a view building task fails locally on the worker
(example reason: view update's target replica is not available),
the worker will retry this work instead of reporting a failure to the
coordinator.

However, we left return type of the RPC, which was telling if a task was
finished successfully or aborted.
But the worker doesn't need to report that a task was aborted, because
it's the coordinator, who decides to abort a task.

So, this commit changes the return type to list of UUIDs of completed
tasks.
Previously length of the returned vector needed to be the same as length
of the vector sent in the request.
No we can drop this restriction and the RPC handler return list of UUIDs
of completed tasks (subset of vector sent in the request).

This change is required to drop `STARTED` state in next commits.

Since Scylla 2025.4 wasn't released yet and we're going to merge this
patch before releasing, no RPC versioning or cluster feature is needed.

(cherry picked from commit 6d853c8f11)
2025-11-26 12:24:42 +00:00
Piotr Dulikowski
deda14b614 Merge '[Backport 2025.4] topology_coordinator: don't repair colocated tablets' from Scylladb[bot]
This backport additionally includes commit 7e201eea from
scylladb/scylla#26543 which causes the tombstone_gc property to be added
to secondary indexes when they are created. This fix was necessary to
resolve the merge conflict. The original commit message of
scylladb/scylladb#27120 follows:

With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.

The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map.  It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.

However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.

We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.

This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.

Fixes https://github.com/scylladb/scylladb/issues/27119

backport to 2025.4 since we must change it in the same version it's introduced before it's released

- (cherry picked from commit 273f664496)

- (cherry picked from commit 005807ebb8)

- (cherry picked from commit 7e201eea1a)

- (cherry picked from commit 868ac42a8b)

Parent PR: #27120

Closes scylladb/scylladb#27256

* github.com:scylladb/scylladb:
  tombstone_gc: don't use 'repair' mode for colocated tables
  index: Set tombstone_gc when creating secondary index
  Revert "storage service: add repair colocated tablets rpc"
  topology_coordinator: don't repair colocated tablets
2025-11-26 12:29:48 +01:00
Karol Nowacki
3edd12eba0 vector_search: Restrict vector index tests to tablets only
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.

This change ensures tests using vector indexes run exclusively with tablets.

Fixes: VECTOR-49

Closes scylladb/scylladb#27233
2025-11-26 11:00:30 +02:00
Michael Litvak
ff113dd43a tombstone_gc: don't use 'repair' mode for colocated tables
For tables of special types that can be located: MV, CDC, and paxos
table, we should not use tombstone_gc=repair mode because colocated
tablets are never repaired, hence they will not have repair_time set and
will never be GC'd using 'repair' mode.

(cherry picked from commit 868ac42a8b)
2025-11-26 08:36:52 +01:00
Dawid Mędrek
ecc969ca23 index: Set tombstone_gc when creating secondary index
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. That
was a bug and we fix it now.

Two reproducer tests is added for validation. They reproduce the problem
and don't pass before this commit.

Fixes scylladb/scylladb#26542
2025-11-26 08:22:08 +01:00
Michael Litvak
0e7be65bb1 Revert "storage service: add repair colocated tablets rpc"
This reverts commit 11f045bb7c.

The rpc was added together with colocated tablets in 2025.4 to support a
"shared repair" operation of a group of colocated tablets that repairs
all of them and allows also for special behavior as opposed to repairing
a single specific tablet.

It is not used anymore because we decided to not repair all colocated
tablets in a single shared operation, but to repair only the base table,
and in a later release support repairing colocated tables individually.

We can remove the rpc in 2025.4 because it is introduced in the same
version.

(cherry picked from commit 005807ebb8)
2025-11-26 07:52:18 +01:00
Michael Litvak
4fd00902fc topology_coordinator: don't repair colocated tablets
With the introduction of colocated tables, all the tablet transitions
now operate on groups of colocated tablets instead of individual
tablets. such is tablet migration, and also tablet repair.

The tablet repair currently doesn't work on individual tablets due to
the limitations in the tablet map being shared. The way it was
implemented to work on a group of colocated tablets is by repairing all
the colocated tablets together, using a dedicated rpc, and setting a
shared repair_time in the shared tablet map.  It was implemented this
way because we wanted to have some way to repair the tablets of a
colocated table.

However, we want to change this in the next release so that it will be
possible to repair the tablets of a colocated table individually. In
order to simplify and prepare for the future change, we prefer until
then to not repair colocated tables at all. otherwise, we will need to
support both the shared repair and individual repair together for a long
time, and the upgrade will be more complicated.

We change the handling of the tablet 'repair' transition to repair only
the base table's tablets. It means it will not be possible to request
tablet repair for a non-base colocated table such as local MV, CDC and
paxos table. This restriction will be temporary until a later release
where we will suuport repairing colocated tablets.

This is a reasonable restriction because repair for these kind of tables
is not required or as important as for normal tables.

Fixes scylladb/scylladb#27119

(cherry picked from commit 273f664496)
2025-11-26 05:20:02 +00:00
Pavel Emelyanov
c75375b148 Merge '[Backport 2025.4] streaming: fix loop break condition in tablet_sstable_streamer::stream' from Scylladb[bot]
When streaming SSTables by tablet range, the original implementation of tablet_sstable_streamer::stream may break out of the loop too early when encountering a non-overlaping SSTable. As a result, subsequent SSTables that should be classified as partially contained are skipped entirely.

Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]

The loop uses if (!overlaps) break; semantics, which conflated “no overlap” with “done scanning.” This caused premature termination when an SSTable did not overlapped but the following one did.

Correct logic should be:

before(sst_last) → skip and continue.

after(sst_first) → break (no further SSTables can overlap).

Otherwise → `contains` to classify as full or partial.

Missing SSTables in streaming and potential data loss or incomplete streaming in repair/streaming operations.

1. Correct the loop termination logic that previously caused certain SSTables to be prematurely excluded, resulting in lost mutations. This change ensures all relevant SSTables are properly streamed and their mutations preserved.
2. Refactor the loop to use before() and after() checks explicitly, and only break when the SSTable is entirely after the tablet range
3. Add pytest to cover this case, full streaming flow by means of `restore`
4. Add boost tests to test the new refactored function

This data corruption fix should be ported back to 2024.2, 2025.1, 2025.2, 2025.3 and 2025.4

Fixes: https://github.com/scylladb/scylladb/issues/26979

- (cherry picked from commit 656ce27e7f)

- (cherry picked from commit dedc8bdf71)

Parent PR: #26980

Closes scylladb/scylladb#27156

* github.com:scylladb/scylladb:
  streaming: fix loop break condition in tablet_sstable_streamer::stream
  streaming: add pytest case to reproduce mutation loss issue
2025-11-25 20:20:45 +03:00
Ernest Zaslavsky
b4c8748393 streaming: fix loop break condition in tablet_sstable_streamer::stream
Correct the loop termination logic that previously caused
certain SSTables to be prematurely excluded, resulting in
lost mutations. This change ensures all relevant SSTables
are properly streamed and their mutations preserved.

(cherry picked from commit dedc8bdf71)
2025-11-25 11:42:34 +02:00
Ernest Zaslavsky
fa58e526a0 streaming: add pytest case to reproduce mutation loss issue
Introduce a test that demonstrates mutation loss caused by premature
loop termination in tablet_sstable_streamer::stream. The code broke
out of the SSTable iteration when encountering a non-overlapping range,
which skipped subsequent SSTables that should have been partially
contained. This test showcases the problem only.

Example:
Tablet range: [4, 5]
SSTable ranges:
[0,5]
[0, 3] <--- is considered exhausted, and causes skip to next tablet
[2, 5] <--- is missed for range [4, 5]

(cherry picked from commit 656ce27e7f)
2025-11-25 11:42:18 +02:00
Tomasz Grabiec
612c4321c8 Merge '[Backport 2025.4] address_map: Use more efficient and reliable replication method' from Scylladb[bot]
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865

- (cherry picked from commit ed8d127457)

- (cherry picked from commit 4a85ea8eb2)

- (cherry picked from commit f83c4ffc68)

Parent PR: #26941

Closes scylladb/scylladb#27189

* github.com:scylladb/scylladb:
  address_map: Use barrier() to wait for replication
  address_map: Use more efficient and reliable replication method
  utils: Introduce helper for replicated data structures
2025-11-24 16:43:56 +01:00
Pavel Emelyanov
36234e18ee Merge '[Backport 2025.4] Support local primary-replica-only for native restore' from Scylladb[bot]
This PR extends the restore API so that it accepts primary_replica_only as parameter and it combines the concepts of primary-replica-only with scoped streaming so that with:
- `scope=all primary_replica_only=true` The restoring node will stream to the global primary replica only
- `scope=dc primary_replica_only=true` The restoring node will stream to the local primary replica only.
- `scope=rack primary_replica_only=true` The restoring node will stream only to the primary replica from within its own rack (with rf=#racks, the restoring node will stream only to itself)
- `scope=node primary_replica_only=true` is not allowed, the restoring node will always stream only to itself so the primary_replica_only parameter wouldn't make sense.

The PR also adjusts the `nodetool refresh` restriction on running restore with both primary_replica_only and scope, it adds primary_replica_only to `nodetool restore` and it adds cluster tests for primary replica within scope.

Fixes #26584

- (cherry picked from commit 965a16ce6f)

- (cherry picked from commit 136b45d657)

- (cherry picked from commit 83aee954b4)

- (cherry picked from commit c1b3fe30be)

- (cherry picked from commit d4e43bd34c)

- (cherry picked from commit 817fdadd49)

- (cherry picked from commit a04ebb829c)

Parent PR: #26609

Closes scylladb/scylladb#27011

* github.com:scylladb/scylladb:
  Add cluster tests for checking scoped primary_replica_only streaming
  Improve choice distribution for primary replica
  Refactor cluster/object_store/test_backup
  nodetool restore: add primary-replica-only option
  nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
  Enable scoped primary replica only streaming
  Support primary_replica_only for native restore API
2025-11-24 13:37:25 +03:00
Tomasz Grabiec
9e5e8bec69 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.

(cherry picked from commit f83c4ffc68)
2025-11-23 21:05:34 +00:00
Tomasz Grabiec
f6e90d5a5f address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835

(cherry picked from commit 4a85ea8eb2)
2025-11-23 21:05:33 +00:00
Tomasz Grabiec
8069948f0a utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.

(cherry picked from commit ed8d127457)
2025-11-23 21:05:33 +00:00
Piotr Dulikowski
05efc8fd21 Merge '[Backport 2025.4] vector search: Add HTTPS requests support' from Scylladb[bot]
vector_search: Add HTTPS support for vector store connections

This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327

Backport to 2025.4 as this feature is expected to be available in 2025.4.

- (cherry picked from commit c40b3ba4b3)

- (cherry picked from commit 58456455e3)

Parent PR: #26935

Closes scylladb/scylladb#27180

* github.com:scylladb/scylladb:
  test: vector_search: Ensure all clients are stopped on shutdown
  vector_search: Add HTTPS support for vector store connections
2025-11-23 01:24:18 +01:00
Karol Nowacki
487286d296 test: vector_search: Ensure all clients are stopped on shutdown
A flaky test revealed that after `clients::stop()` was called,
the `old_clients` collection was sometimes not empty,
indicating that some clients were not being stopped correctly.
This resulted in sanitizer errors when objects went out of scope at the end of the test.

This patch modifies `stop()` to ensure all clients, including those in `old_clients`,
are stopped, guaranteeing a clean shutdown.
2025-11-22 21:53:37 +01:00
Karol Nowacki
f4ced4c31a vector_search: Add HTTPS support for vector store connections
This commit introduces TLS encryption support for vector store connections.
A new configuration option is added:
- vector_store_encryption_options.truststore: path to the trust store file

To enable secure connections, use the https:// scheme in the
vector_store_primary_uri/vector_store_secondary_uri configuration options.

Fixes: VECTOR-327
2025-11-22 21:53:29 +01:00
Piotr Dulikowski
a70fe7dec1 Merge '[Backport 2025.4] vector_search: Fix error handling and status parsing' from Scylladb[bot]
vector_search: Fix error handling and status parsing

This change addresses two issues in the vector search client that caused
validator test failures: incorrect handling of 5xx server errors and
faulty status response parsing.

1.  5xx Error Handling:
    Previously, a 5xx response (e.g., 503 Service Unavailable) from the
    underlying vector store for an `/ann` search request was incorrectly
    interpreted as a node failure. This would cause the node to be marked
    as down, even for transient issues like an index scan being in progress.

    This change ensures that 5xx errors are treated as transient search
    failures, not node failures, preventing nodes from being incorrectly
    marked as down.

2.  Status Response Parsing:
    The logic for parsing status responses from the vector store was
    flawed. This has been corrected to ensure proper parsing.

Fixes: SCYLLADB-50

Backport to 2025.4 as this problem is present on this branch.

- (cherry picked from commit 05b9cafb57)

- (cherry picked from commit 366ecef1b9)

- (cherry picked from commit 9563d87f74)

Parent PR: #27111

Closes scylladb/scylladb#27145

* github.com:scylladb/scylladb:
  vector_search: Don't mark nodes as down on 5xx server errors
  test: vector_search: Move unavailable_server to dedicated file
  vector_search: Fix status response parsing
2025-11-22 18:53:44 +01:00
Karol Nowacki
727d2a2148 vector_search: Don't mark nodes as down on 5xx server errors
For an `/ann` search request, a 5xx server response does not
indicate that the node is down. It can signify a transient state, such
as the index full scan being in progress.

Previously, treating a 503 error as a node fault would cause the node
to be incorrectly marked as down, for example, when a new index was
being created. This commit ensures that such errors are treated as
transient search failures, not node failures.
2025-11-22 14:43:53 +01:00
Karol Nowacki
e44ea45844 test: vector_search: Move unavailable_server to dedicated file
The unavailable_server code will be reused in upcoming client unit tests.
2025-11-22 14:43:53 +01:00
Karol Nowacki
7b71af1186 vector_search: Fix status response parsing
The response was incorrectly parsed as a plain string and compared
directly with C++ string. However, the body contains a JSON string,
which includes escaped quotes that caused comparison failures.
2025-11-22 14:43:53 +01:00
Karol Nowacki
cbc9bb8d05 vector_search: Add support for secondary vector store clients
This change adds support for secondary vector store clients, typically
located in different availability zones. Secondary clients serve as
fallback targets when all primary clients are unavailable.
New configuration option allows specifying secondary client addresses
and ports.

Fixes: VECTOR-187

Closes scylladb/scylladb#27159
2025-11-22 11:05:34 +01:00
Piotr Dulikowski
12bfdfc312 Merge '[Backport 2025.4] vector_search: Improve vector-store health checking' from Scylladb[bot]
A Vector Store node is now considered down if it returns an HTTP 500
server error. This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.

The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is explicitly 'SERVING'.

Fixes: VECTOR-187

Backport to 2025.4 as this feature is expected to be available in 2025.4.

- (cherry picked from commit ee3b83c9b0)

- (cherry picked from commit f665564537)

- (cherry picked from commit cb654d2286)

- (cherry picked from commit 4bbba099d7)

- (cherry picked from commit 5c30994bc5)

- (cherry picked from commit 7f45f15237)

Parent PR: #26413

Closes scylladb/scylladb#27087

* github.com:scylladb/scylladb:
  vector_search: Improve vector-store health checking
  vector_search: Move response_content_to_sstring to utils.hh
  vector_search: Add unit tests for client error handling
  vector_search: Enable mocking of status requests
  vector_search: Extract abort_source_timeout and repeat_until
  vector_search: Move vs_mock_server to dedicated files
2025-11-21 20:49:09 +01:00
Anna Stuchlik
10bba1d8a7 doc: update information for n2-highmem instances
This commit updates the section for n2-highmem instances
on the Cloud Instance Recommendations page

- Added missing support for n2-highmem-96
- Update the reference to n2 instances in the Google Cloud docs.
- Added the missing information about processors for this instance
  type (Ice Lake and Cascade Lake).

(cherry picked from commit dab74471cc)
2025-11-21 18:30:09 +00:00
Anna Stuchlik
a02a8c7460 doc: remove n1-highmem instances from Recommended Instances
(cherry picked from commit b18b052d26)
2025-11-21 18:30:09 +00:00
Raphael S. Carvalho
f2ee409fdd replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078

(cherry picked from commit 74ecedfb5c)

Closes scylladb/scylladb#27158
2025-11-21 17:50:21 +03:00
Geoff Montee
8893ef4903 Update update-topology-strategy-from-simple-to-network.rst: Multiple clarifications to page and sub-procedures
Fixes #27077

Multiple points can be clarified relating to:

* Names of each sub-procedure could be clearer
* Requirements of each sub-procedure could be clearer
* Clarify which keyspaces are relevant and how to check them
* Fix typos in keyspace name

Closes scylladb/scylladb#26855

(cherry picked from commit a0734b8605)

Closes scylladb/scylladb#27157
2025-11-21 17:50:00 +03:00
Asias He
f401d3d3b1 docs: Add feature page for incremental repair
Adds a new documentation page for the incremental repair feature.

The page covers:
- What incremental repair is and its benefits over the standard repair process.
- How it works at a high level by tracking the repair status of SSTables.
- The prerequisite of using the tablets architecture.
- The different user-configurable modes: 'regular', 'full', and 'disabled'.

Fixes #25600

Closes scylladb/scylladb#26221

(cherry picked from commit 3cf1225ae6)

Closes scylladb/scylladb#27149
2025-11-21 17:49:35 +03:00
Karol Nowacki
a309a488f1 vector_search: Improve vector-store health checking
A Vector Store node is now considered down if it returns an HTTP 5xx status.
This can happen, for example, if the node fails to
connect to the database or has not completed its initial full scan.

The logic for marking a node as 'up' is also enhanced. A node is now
only considered up when its status is 'SERVING'.
2025-11-21 14:04:11 +01:00
Karol Nowacki
a9ddc35d1c vector_search: Move response_content_to_sstring to utils.hh
Move the response_content_to_sstring utility function from
vector_store_client.cc to utils.hh to enable reuse across
multiple files.

This refactoring prepares for the upcoming `client.cc` implementation
that will also need this functionality.
2025-11-21 14:04:11 +01:00
Karol Nowacki
e5addfcb14 vector_search: Add unit tests for client error handling
Introduce dedicated unit tests for the client class to verify existing
functionality and serve as regression tests.
These tests ensure that invalid client requests do not cause nodes to
be marked as down.
2025-11-21 14:04:11 +01:00
Karol Nowacki
440f656adb vector_search: Enable mocking of status requests
Extend the mock server to allow inspecting incoming status requests and
configuring their responses.

This enables client unit tests to simulate various server behaviors,
such as handling node failures and backoff logic.
2025-11-21 14:04:11 +01:00
Karol Nowacki
1524fa2665 vector_search: Extract abort_source_timeout and repeat_until
The `abort_source_timeout` and `repeat_until` functions are moved to
the shared utility header `test/vector_search/utils.hh`.

This allows them to be reused by upcoming `client` unit tests, avoiding
code duplication.
2025-11-21 14:04:10 +01:00
Karol Nowacki
6e7a06719c vector_search: Move vs_mock_server to dedicated files
The mock server utility is extracted into its own files so it can be
reused by future `client` unit tests.
2025-11-21 14:04:10 +01:00
Benny Halevy
537a7d6203 test/pylib/cpp: increase max-networking-io-control-blocks value
Increase the value of the max-networking-io-control-blocks option
for the cpp tests as it is too low and causes flakiness
as seen in vector_search.vector_store_client_test.vector_store_client_single_status_check_after_concurrent_failures:
```
seastar/src/core/reactor_backend.cc:342: void seastar::aio_general_context::queue(linux_abi::iocb *): Assertion `last < end` failed.
```

See also https://github.com/scylladb/seastar/issues/976

Fixes #27056

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#27117

(cherry picked from commit fd81333181)

Closes scylladb/scylladb#27135
2025-11-21 11:32:25 +01:00
Piotr Dulikowski
1fac32156b Merge '[Backport 2025.4] vector_store_client: Add support for failed-node backoff' from Scylladb[bot]
vector_search: Add backoff for failed nodes

Introduces logic to mark nodes that fail to answer an ANN request as
"down". Down nodes are omitted from further requests until they
successfully respond to a health check.

Health checks for down nodes are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.

Client list management is moved to separate files (clients.cc/clients.hh)
to improve code organization and modularity.

Fixes: VECTOR-187.

Backport to 2025.4 as this feature is expected to be available in 2025.4.

- (cherry picked from commit 62f8b26bd7)

- (cherry picked from commit 49a177b51e)

- (cherry picked from commit 190459aefa)

- (cherry picked from commit 009d3ea278)

- (cherry picked from commit 940ed239b2)

- (cherry picked from commit 097c0f9592)

- (cherry picked from commit 1972fb315b)

Parent PR: #26308

Closes scylladb/scylladb#27054

* github.com:scylladb/scylladb:
  vector_search: Set max backoff delay to 2x read request timeout
  vector_search: Report status check exception via on_internal_error_noexcept
  vector_search: Extract client management into dedicated class
  vector_search: Add backoff for failed clients
  vector_search: Make endpoint available
  vector_search: Use std::expected for low-level client errors
  vector_search: Extract client class
2025-11-20 19:41:49 +01:00
Karol Nowacki
1268983895 vector_search: Set max backoff delay to 2x read request timeout
The maximum backoff delay for status checking now depends on the
`read_request_timeout_in_ms` configuration option. The delay is set
to twice the value of this parameter.
2025-11-20 14:58:35 +01:00
Karol Nowacki
4aa131f09d vector_search: Report status check exception via on_internal_error_noexcept
This exception should only occur due to internal errors, not client or external issues.
If triggered, it indicates an internal problem. Therefore, we notify about this exception
using on_internal_error_noexcept.
2025-11-20 14:58:32 +01:00
Karol Nowacki
c3520babac vector_search: Extract client management into dedicated class
Refactor client list management by moving it to separate files
(clients.cc/clients.hh) to improve code organization and modularity.
2025-11-20 14:58:18 +01:00
Karol Nowacki
7efefad606 vector_search: Add backoff for failed clients
Introduces logic to mark clients that fail to answer an ANN request as
"down". Down clients are omitted from further requests until they
successfully respond to a health check.

Health checks for down clients are performed in the background using the
`status` endpoint, with an exponential backoff retry policy ranging
from 100ms to 20s.
2025-11-20 14:57:36 +01:00
Karol Nowacki
580df81065 vector_search: Make endpoint available
In preparation for a new feature, the tests need the ability to make
an endpoint that was previously unavailable, available again.

This is achieved by adding an `unavailable_server::take_socket` method.
This method allows transferring the listening socket from the
`unavailable_server` to the `mock_vs_server`, ensuring they both
operate on the same endpoint.
2025-11-20 14:57:32 +01:00
Karol Nowacki
21da7017ef vector_search: Use std::expected for low-level client errors
To unify error handling, the low-level client methods now return
`std::expected` instead of throwing exceptions. This allows for
consistent and explicit error propagation from the client up to the
caller.

The relevant error types have been moved to a new `vector_search/error.hh`
header to centralize their definitions.
2025-11-20 14:57:29 +01:00
Karol Nowacki
c803fb31b0 vector_search: Extract client class
This refactoring extracts low-level client logic into a new, dedicated
`client` class. The new class is responsible for connecting to the
server and serializing requests.

This change prepares for extending the `vector_store_client` to check
node status via the `api/v1/status` endpoint.

`/ann` Response deserialization remains in the `vector_store_client` as it
is schema-dependent.
2025-11-20 14:57:24 +01:00
Michał Chojnowski
30a6a2c7a7 sstables/trie/trie_writer: free nodes after they are flushed
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.

This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.

Fix that, and add some test that catches this.

Fixes scylladb/scylladb#27082

Closes scylladb/scylladb#27083

(cherry picked from commit d8e299dbb2)

Closes scylladb/scylladb#27122
2025-11-20 10:37:40 +02:00
Patryk Jędrzejczak
56915477ed test: test_raft_recovery_stuck: ensure mutual visibility before using driver
Not waiting for nodes to see each other as alive can cause the driver to
fail the request sent in `wait_for_upgrade_state()`.

scylladb/scylladb#19771 has already replaced concurrent restarts with
`ManagerClient.rolling_restart()`, but it has missed this single place,
probably because we do concurrent starts here.

Fixes #27055

Closes scylladb/scylladb#27075

(cherry picked from commit e35ba974ce)

Closes scylladb/scylladb#27110
2025-11-20 10:36:54 +02:00
Szymon Wasik
7464c21e71 Add documentation about lack of returning similarity distances
This patch adds the missing warning about the lack of possibility
to return the similarity distance. This will be added in the next
iteration.

Fixes #27086

It has to be backported to 2025.4 as this is the limitation in 2025.4.

Closes scylladb/scylladb#27096

(cherry picked from commit f714876eaf)

Closes scylladb/scylladb#27105
2025-11-20 10:36:02 +02:00
Botond Dénes
dd9d3edeae Merge '[Backport 2025.4] Automatic cleanup improvements' from Scylladb[bot]
This series allows an operator to reset 'cleanup needed' flag if he already cleaned up the node, so that automatic cleanup will not do it again. We also change 'nodetool cleanup' back to run cleanup on one node only (and reset 'cleanup needed' flag in the end), but the new '--global' option allows to run cleanup on all nodes that needed it simultaneously.

Fixes https://github.com/scylladb/scylladb/issues/26866

Backport to all supported version since automatic cleanup behaviour  as it is now may create unexpected by the operator load during cluster resizing.

- (cherry picked from commit e872f9cb4e)

- (cherry picked from commit 0f0ab11311)

Parent PR: #26868

Closes scylladb/scylladb#27095

* github.com:scylladb/scylladb:
  cleanup: introduce "nodetool cluster cleanup" command  to run cleanup on all dirty nodes in the cluster
  cleanup: Add RESTful API to allow reset cleanup needed flag
2025-11-20 10:35:26 +02:00
Botond Dénes
df1e988ed7 Merge '[Backport 2025.4] encryption::kms_host: Add exponential backoff-retry for 503 errors' from Scylladb[bot]
Refs #26822
Fixes #27062

AWS says to treat 503 errors, at least in the case of ec2 metadata query, as backoff-retry (generally, we do _not_ retry on provider level, but delegate this to higher levels). This patch adds special treatment for 503:s (service unavailable) for both ec2 meta and    actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing (doh!). Should try to set up a mock ec2 meta with injected errors maybe.

- (cherry picked from commit 190e3666cb)

- (cherry picked from commit d22e0acf0b)

Parent PR: #26934

Closes scylladb/scylladb#27064

* github.com:scylladb/scylladb:
  encryption::kms_host: Add exponential backoff-retry for 503 errors
  encryption::kms_host: Include http error code in kms_error
2025-11-20 10:34:46 +02:00
Michał Hudobski
b41dad499b secondary_index: disallow multiple vector indexes on the same column
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.

To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.

We also add a test that checks this behavior.

Fixes: VECTOR-254
Fixes: #26672

Closes scylladb/scylladb#26508

(cherry picked from commit 46589bc64c)

Closes scylladb/scylladb#27057
2025-11-20 10:33:55 +02:00
Botond Dénes
a59805a926 Merge '[Backport 2025.4] topology_coordinator: include joining node in barrier' from Scylladb[bot]
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes https://github.com/scylladb/scylladb/issues/26976

backport to previous versions since it fixes a bug in MV with vnodes

- (cherry picked from commit 13d94576e5)

- (cherry picked from commit b925e047be)

Parent PR: #27008

Closes scylladb/scylladb#27040

* github.com:scylladb/scylladb:
  test: add mv write during node join test
  topology_coordinator: include joining node in barrier
2025-11-20 10:32:53 +02:00
Botond Dénes
e8a01ef8c6 Merge '[Backport 2025.4] db/view: Add backoff when RPC fails' from Scylladb[bot]
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test.

Fixes scylladb/scylladb#26686

Backport: impact on the user. We should backport it to 2025.4.

- (cherry picked from commit 4a5b1ab40a)

- (cherry picked from commit acd9120181)

- (cherry picked from commit 393f1ca6e6)

Parent PR: #26729

Closes scylladb/scylladb#27027

* github.com:scylladb/scylladb:
  tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
  db/view/view_building_coordinator: Rate limit logging failed RPC
  db/view: Add backoff when RPC fails
2025-11-20 10:31:55 +02:00
Michał Hudobski
c159ec4778 select_statement: add a warning about unsupported paging for vs queries
Currently we do not support paging for vector search queries.
When we get such a query with paging enabled we ignore the paging
and return the entire result. This behavior can be confusing for users,
as there is no warning about paging not working with vector search.
This patch fixes that by adding a warning to the result of ANN queries
with paging enabled.

Closes scylladb/scylladb#26384

(cherry picked from commit 7646dde25b)

Closes scylladb/scylladb#27016
2025-11-20 10:31:03 +02:00
Botond Dénes
034845e6eb Merge '[Backport 2025.4] sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db' from Scylladb[bot]
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).

But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.

The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.

Also, extend a related test so that it would catch the problem before the fix.

Fixes scylladb/scylladb#26393

Bugfix, needs backport to 2025.4.

- (cherry picked from commit 16cb223d7f)

- (cherry picked from commit 6efb807c1a)

Parent PR: #26394

Closes scylladb/scylladb#26409

* github.com:scylladb/scylladb:
  sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
  test/boost/database_test: fix two no-op distributed loader tests
2025-11-20 10:29:51 +02:00
Robert Bindar
e8d376c2ea Add cluster tests for checking scoped primary_replica_only streaming
This commits adds a tests checking various scenarios of restoring
via load and stream with primary_replica_only and a scope specified.

The tests check that in a few topologies, a mutation is replicated
a correct amount of times given primary_replica_only and that
streaming happens according to the scope rule passed.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit a04ebb829c)
2025-11-19 11:55:12 +02:00
Gleb Natapov
58d395f1d2 cleanup: introduce "nodetool cluster cleanup" command to run cleanup on all dirty nodes in the cluster
97ab3f6622 changed "nodetool cleanup" (without arguments) to run
cleanup on all dirty nodes in the cluster. This was somewhat unexpected,
so this patch changes it back to run cleanup on the target node only (and
reset "cleanup needed" flag afterwards) and it adds "nodetool cluster
cleanup" command that runs the cleanup on all dirty nodes in the
cluster.

(cherry picked from commit 0f0ab11311)
2025-11-19 11:19:07 +02:00
Gleb Natapov
62caf6ac59 cleanup: Add RESTful API to allow reset cleanup needed flag
Cleaning up a node using per keyspace/table interface does not reset cleanup
needed flag in the topology. The assumption was that running cleanup on
already clean node does nothing and completes quickly. But due to
https://github.com/scylladb/scylladb/issues/12215 (which is closed as
WONTFIX) this is not the case. This patch provides the ability to reset
the flag in the topology if operator cleaned up the node manually
already.

(cherry picked from commit e872f9cb4e)
2025-11-19 11:10:05 +02:00
Robert Bindar
3d4b318051 Improve choice distribution for primary replica
I noticed during tests that `maybe_get_primary_replica`
would not distribute uniformly the choice of primary replica
because `info.replicas` on some shards would have an order whilst
on others it'd be ordered differently, thus making the function choose
a node as primary replica multiple times when it clearly could've
chosen a different nodes.

This patch sorts the replica set before passing it through the
scope filter.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 817fdadd49)
2025-11-18 15:44:58 +02:00
Robert Bindar
6a05e150f8 Refactor cluster/object_store/test_backup
This PR splits the suppport code from test_backup.py
into multiple functions so less duplicated code is
produced by new tests using it. It also makes it a bit
easier to understand.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit d4e43bd34c)
2025-11-18 15:44:55 +02:00
Robert Bindar
c3e7d6d652 nodetool restore: add primary-replica-only option
Add --primary-replica-only and update docs page for
nodetool restore.

The relationship with the scope parameter is:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
  replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit c1b3fe30be)
2025-11-18 12:47:02 +02:00
Robert Bindar
d2ebb7af38 nodetool refresh: Enable scope={all,dc,rack} with primary_replica_only
So far it was not allowed to pass a scope when using
the primary_replica_only option, this patch enables
it because the concepts are now combined so that:
- scope=all primary_replica_only=true gets the global primary replica
- scope=dc primary_replica_only=true gets the local primary replica
- scope=rack primary_replica_only=true is like a noop, it gets the only
  replica in the rack (rf=#racks)
- scope=node primary_replica_only=node is not allowed

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 83aee954b4)
2025-11-18 12:47:02 +02:00
Robert Bindar
5383509a47 Enable scoped primary replica only streaming
This patch removes the restriction for streaming
to primary replica only within a scope.
Node scope streaming to primary replica is dissallowed.

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 136b45d657)
2025-11-18 12:46:55 +02:00
Robert Bindar
e08251adb5 Support primary_replica_only for native restore API
Current native restore does not support primary_replica_only, it is
hard-coded disabled and this may lead to data amplification issues.

This patch extends the restore REST API to accept a
primary_replica_only parameter and propagates it to
sstables_loader so it gets correctly passed to
load_and_stream.

Fixes #26584

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>
(cherry picked from commit 965a16ce6f)
2025-11-18 12:43:31 +02:00
Piotr Dulikowski
f0396b83c2 Merge '[Backport 2025.4] service/qos: Fall back to default scheduling group when using maintenance socket' from Scylladb[bot]
The service level controller relies on `auth::service` to collect
information about roles and the relation between them and the service
levels (those attached to them). Unfortunately, the service level
controller is initialized way earlier than `auth::service` and so we
had to prevent potential invalid queries of user service levels
(cf. 46193f5e79).

Unfortunately, that came at a price: it made the maintenance socket
incompatible with the current implementation of the service level
controller. The maintenance socket starts early, before the
`auth::service` is fully initialized and registered, and is exposed
almost immediately. If the user attempts to connect to Scylla within
this time window, via the maintenance socket, one of the things that
will happen is choosing the right service level for the connection.
Since the `auth::service` is not registered, Scylla with fail an
assertion and crash.

A similar scenario occurs when using maintenance mode. The maintenance
socket is how the user communicates with the database, and we're not
prepared for that either.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

Some accesses to `auth::service` are not affected and we do not modify
those.

Fixes scylladb/scylladb#26816

Backport: yes. This is a fix of a regression.

- (cherry picked from commit c0f7622d12)

- (cherry picked from commit 222eab45f8)

- (cherry picked from commit 394207fd69)

- (cherry picked from commit b357c8278f)

Parent PR: #26856

Closes scylladb/scylladb#27043

* github.com:scylladb/scylladb:
  test/cluster/test_maintenance_mode.py: Wait for initialization
  test: Disable maintenance mode correctly in test_maintenance_mode.py
  test: Fix keyspace in test_maintenance_mode.py
  service/qos: Do not crash Scylla if auth_integration absent
2025-11-17 17:22:23 +01:00
Piotr Dulikowski
f86b50eedd Merge '[Backport 2025.4] db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Scylladb[bot]
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.

To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.

The patch should be backported to 2025.4

Fixes https://github.com/scylladb/scylladb/issues/26244

- (cherry picked from commit 2e8c096930)

- (cherry picked from commit c99231c4c2)

- (cherry picked from commit 4bc6361766)

- (cherry picked from commit 9345c33d27)

Parent PR: #26454

Closes scylladb/scylladb#27058

* github.com:scylladb/scylladb:
  service/storage_service: migrate staging sstables in view building worker during intra-node migration
  db/view/view_building_worker: support sstables intra-node migration
  db/view_building_worker: fix indent
  db/view/view_building_worker: don't organize staging sstables by last token
2025-11-17 17:10:54 +01:00
Botond Dénes
fd29c162b7 Merge '[Backport 2025.4] api: storage_service: tasks: unify sync and async compaction APIs' from Scylladb[bot]
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of synchronous and asynchronous cleanup, major
compaction, and upgrade_sstables.

Fixes: https://github.com/scylladb/scylladb/issues/26715.

Requires backports to all live versions

- (cherry picked from commit 12dabdec66)

- (cherry picked from commit 044b001bb4)

- (cherry picked from commit fdd623e6bc)

Parent PR: #26746

Closes scylladb/scylladb#26890

* github.com:scylladb/scylladb:
  api: storage_service: tasks: unify upgrade_sstable
  api: storage_service: tasks: force_keyspace_cleanup
  api: storage_service: tasks: unify force_keyspace_compaction
2025-11-17 14:56:18 +02:00
Calle Wilund
bb2a275ed3 encryption::kms_host: Add exponential backoff-retry for 503 errors
Refs #26822

AWS says to treat 503 errors, at least in the case of ec2 metadata
query, as backoff-retry (generally, we do _not_ retry on provider
level, but delegate this to higher levels). This patch adds special
treatment for 503:s (service unavailable) for both ec2 meta and
actual endpoint, doing exponential backoff.

Note: we do _not_ retry forever.
Not tested as such, since I don't get any errors when testing
(doh!). Should try to set up a mock ec2 meta with injected errors
maybe.

v2:
* Use utils::exponential_backoff_retry

(cherry picked from commit d22e0acf0b)
2025-11-17 11:49:17 +00:00
Calle Wilund
1ecdb37cfa encryption::kms_host: Include http error code in kms_error
Keep track of actual HTTP failure.

(cherry picked from commit 190e3666cb)
2025-11-17 11:49:17 +00:00
Aleksandra Martyniuk
d3fae2c2f2 locator: use get_primary_replica for get_primary_endpoints
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.

Use get_primary_replica for get_primary_endpoints.

Fixes: https://github.com/scylladb/scylladb/issues/21883.

Closes scylladb/scylladb#26385

(cherry picked from commit 910cd0918b)

Closes scylladb/scylladb#27020
2025-11-17 12:41:29 +02:00
Michał Jadwiszczak
3670afb840 service/storage_service: migrate staging sstables in view building
worker during intra-node migration

Use methods introduces in previous commit and:
- load staging sstables to the view building worker on the target
  shard, at the end of `streaming` stage
- clear migrated staging sstables on source shard in `cleanup` stage

This patch also removes skip mark in `test_staging_sstables_with_tablet_merge`.

Fixes scylladb/scylladb#26244

(cherry picked from commit 9345c33d27)
2025-11-17 10:28:35 +00:00
Michał Jadwiszczak
7a19786185 db/view/view_building_worker: support sstables intra-node migration
We need to be able to load sstables on the target shard during
intra-node tablet migration and to cleanup migrated sstables on the
source shard.

(cherry picked from commit 4bc6361766)
2025-11-17 10:28:35 +00:00
Michał Jadwiszczak
2ae1b8c5cd db/view_building_worker: fix indent
(cherry picked from commit c99231c4c2)
2025-11-17 10:28:35 +00:00
Michał Jadwiszczak
67ba46d7ea db/view/view_building_worker: don't organize staging sstables by last token
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.

Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.

(cherry picked from commit 2e8c096930)
2025-11-17 10:28:35 +00:00
Piotr Dulikowski
85efcf35c7 Merge '[Backport 2025.4] cdc: set column drop timestamp in the future' from Scylladb[bot]
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes https://github.com/scylladb/scylladb/issues/26340

the issue affects all previous releases, backport to improve stability

- (cherry picked from commit eefae4cc4e)

- (cherry picked from commit 48298e38ab)

- (cherry picked from commit 039323d889)

- (cherry picked from commit e85051068d)

Parent PR: #26533

Closes scylladb/scylladb#27041

* github.com:scylladb/scylladb:
  test: test concurrent writes with column drop with cdc preimage
  cdc: check if recreating a column too soon
  cdc: set column drop timestamp in the future
  migration_manager: pass timestamp to pre_create
2025-11-17 08:59:12 +01:00
Benny Halevy
35e5f59c75 scylla-sstable: correctly dump sharding_metadata
This patch fixes 2 issues at one go:

First, Currently sstables::load clears the sharding metadata
(via open_data()), and so scylla-sstable always prints
an empty array for it.

Second, printing token values would generate invalid json
as they are currently printed as binary bytes, and they
should be printed simply as numbers, as we do elsewhere,
for example, for the first and last keys.

Fixes #26982

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26991

(cherry picked from commit f9ce98384a)

Closes scylladb/scylladb#27042
2025-11-16 16:58:58 +02:00
Michael Litvak
05ac346927 test: test concurrent writes with column drop with cdc preimage
add a test that writes to a table concurrently with dropping a column,
where the table has CDC enabled with preimage.

the test reproduces issue #26340 where this results in a malformed
sstable.

(cherry picked from commit e85051068d)
2025-11-16 09:29:27 +01:00
Michael Litvak
2beb4ee4f7 cdc: check if recreating a column too soon
When we drop a column from a CDC log table, we set the column drop
timestamp a few seconds into the future. This can cause unexpected
problems if a user tries to recreate a CDC column too soon, before
the drop timestamp has passed.

To prevent this issue, when creating a CDC column we check its
creation timestamp against the existing drop timestamp, if any, and
fail with an informative error if the recreation attempt is too soon.

(cherry picked from commit 039323d889)
2025-11-16 09:28:49 +01:00
Dawid Mędrek
8ae42c1f5f test/cluster/test_maintenance_mode.py: Wait for initialization
If we try to perform queries too early, before the call to
`storage_service::start_maintenance_mode` has finished, we will
fail with the following error:

```
ERROR 2025-11-12 20:32:27,064 [shard 0:sl:d] token_metadata - sorted_tokens is empty in first_token_index!
```

To avoid that, we should wait until initialization is complete.

(cherry picked from commit b357c8278f)
2025-11-15 22:11:06 +00:00
Dawid Mędrek
8446a0543a test: Disable maintenance mode correctly in test_maintenance_mode.py
Although setting the value of `maintenance_mode` to the string `"false"`
disables maintenance mode, the testing framework misinterprets the value
and thinks that it's actually enabled. As a result, it might try to
connect to Scylla via the maintenance socket, which we don't want.

(cherry picked from commit 394207fd69)
2025-11-15 22:11:06 +00:00
Dawid Mędrek
de8783c9cd test: Fix keyspace in test_maintenance_mode.py
The keyspace used in the test is not necessarily called `ks`.

(cherry picked from commit 222eab45f8)
2025-11-15 22:11:06 +00:00
Dawid Mędrek
64293cc7c7 service/qos: Do not crash Scylla if auth_integration absent
If the user connects to Scylla via the maintenance socket, it may happen
that `auth_integration` has not been registered in the service level
controller yet. One example is maintenance mode when that will never
happen; another when the connection occurs before Scylla is fully
initialized.

To avoid unnecessary crashes, we add new branches if the passed user is
absent or if it corresponds to the anonymous role. Since the role
corresponding to a connection via the maintenance socket is the anonymous
role, that solves the problem.

In those cases, we completely circumvent any calls to `auth_integration`
and handle them separately. The modified methods are:

* `get_user_scheduling_group`,
* `with_user_service_level`,
* `describe_service_levels`.

For the first two, the new behavior is in line with the previous
implementation of those functions. The last behaves differently now,
but since it's a soft error, crashing the node is not necessary anyway.
We throw an exception instead, whose error message should give the user
a hint of what might be wrong.

The other uses of `auth_integration` within the service level controller
are not problematic:

* `find_effective_service_level`,
* `find_cached_effective_service_level`.

They take the name of a role as their argument. Since the anonymous role
doesn't have a name, it's not possible to call them with it.

Fixes scylladb/scylladb#26816

(cherry picked from commit c0f7622d12)
2025-11-15 22:11:06 +00:00
Michael Litvak
40d6890f1b cdc: set column drop timestamp in the future
When dropping a column from a CDC log table, set the column drop
timestamp several seconds into the future.

If a value is written to a column concurrently with dropping that
column, the value's timestamp may be after the column drop timestamp. If
this value is also flushed to an SSTable, the SSTable would be
corrupted, because it considers the column missing after the drop
timestamp and doesn't allow values for it.

While this issue affects general tables, it especially impacts CDC tables
because this scenario can occur when writing to a table with CDC preimage
enabled while dropping a column from the base table. This happens even if
the base mutation doesn't write to the dropped column, because CDC log
mutations can generate values for a column even if the base mutation doesn't.
For general tables, this issue can be avoided by simply not writing to a
column while dropping it.

We fix this for the more problematic case of CDC log tables by setting
the column drop timestamp several seconds into the future, ensuring that
writes concurrent with column drops are much less likely to have
timestamps greater than the column drop timestamp.

Fixes scylladb/scylladb#26340

(cherry picked from commit 48298e38ab)
2025-11-15 22:10:55 +00:00
Michael Litvak
153b089d25 migration_manager: pass timestamp to pre_create
pass the write timestamp as parameter to the
on_pre_create_column_families notification.

(cherry picked from commit eefae4cc4e)
2025-11-15 22:10:55 +00:00
Michael Litvak
225bbca3a5 test: add mv write during node join test
Add a test that reproduces the issue scylladb/scylladb#26976.

The test adds a new node with delayed group0 apply, and does writes with
MV updates right after the join completes on the coordinator and while
the joining node's state is behind.

The test fails before fixing the issue and passes after.

(cherry picked from commit b925e047be)
2025-11-15 22:10:33 +00:00
Michael Litvak
fa3947e0d0 topology_coordinator: include joining node in barrier
Previously, only nodes in the 'normal' state and decommissioning nodes
were included in the set of nodes participating in barrier and
barrier_and_drain commands. Joining nodes are not included because they
don't coordinate requests, given their cql port is closed.

However, joining nodes may receive mutations from other nodes, for which
they may generate and coordinate materialized view updates. If their
group0 state is not synchronized it could cause lost view updates.
For example:

1. On the topology coordinator, the join completes and the joining node
   becomes normal, but the joining node's state lags behind. Since it's
   not synchronized by the barrier, it could be in an old state such as
   `write_both_read_old`.
2. A normal node coordinates a write and sends it to the new node as the
   new replica.
3. The new node applies the base mutation but doesn't generate a view
   update for it, because it calculates the base-view pairing according
   to its own state and replication map, and determines that it doesn't
   participate in the base-view pairing.

Therefore, since the joining node participates as a coordinator for view
updates, it should be included in these barriers as well. This ensures
that before the join completes, the joining node's state is
`write_both_read_new`, where it does generate view updates.

Fixes scylladb/scylladb#26976

(cherry picked from commit 13d94576e5)
2025-11-15 22:10:33 +00:00
Dawid Mędrek
df2e2851ce tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
After the changes in the test, we clean up its syntax. It boils
down to very simple modifications.

(cherry picked from commit 393f1ca6e6)
2025-11-15 22:09:07 +00:00
Dawid Mędrek
c848384c6c db/view/view_building_coordinator: Rate limit logging failed RPC
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.

However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.

To mitigate that, we rate limit the logging by 1 seconds.

We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.

(cherry picked from commit acd9120181)
2025-11-15 22:09:07 +00:00
Dawid Mędrek
d6dd7c2e80 db/view: Add backoff when RPC fails
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test: it fails before this commit and succeeds
with it.

Fixes scylladb/scylladb#26686

(cherry picked from commit 4a5b1ab40a)
2025-11-15 22:09:07 +00:00
Jenkins Promoter
bc1e2a54fc Update pgo profiles - aarch64 2025-11-15 04:40:58 +02:00
Jenkins Promoter
3dbc8702b3 Update pgo profiles - x86_64 2025-11-15 03:56:33 +02:00
Jenkins Promoter
11c9c1c16e Update ScyllaDB version to: 2025.4.0-rc5 2025-11-14 22:26:02 +02:00
Aleksandra Martyniuk
323bf7847d api: storage_service: tasks: unify upgrade_sstable
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_upgrade_sstables/{keyspace}
and /tasks/compaction/keyspace_upgrade_sstables/{keyspace}.

(cherry picked from commit fdd623e6bc)
2025-11-14 14:31:30 +01:00
Aleksandra Martyniuk
dad097adce api: storage_service: tasks: force_keyspace_cleanup
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Unify the handlers of /storage_service/keyspace_cleanup/{keyspace}
and /tasks/compaction/keyspace_cleanup/{keyspace}.

(cherry picked from commit 044b001bb4)
2025-11-14 14:31:26 +01:00
Botond Dénes
7c30cee2d6 service/storage_proxy: send batches with CL=EACH_QUORUM
Batches that fail on the initial send are retired later, until they
succeed. These retires happen with CL=ALL, regardless of what the
original CL of the batch was. This is unnecessarily strict. We tried to
follow Cassandra here, but Cassandra has a big caveat in their use of
CL=ALL for batches. They accept saving just a hint for any/all of the
endpoints, so a batch which was just logged in hints is good enough for
them.
We do not plan on replicating this usage of hints at this time, so as a
middle ground, the CL is changed to EACH_QUORUM.

Fixes: scylladb/scylladb#25432

Closes scylladb/scylladb#26304

(cherry picked from commit d9c3772e20)

Closes scylladb/scylladb#26930
2025-11-14 15:30:26 +02:00
Botond Dénes
65633dcd5a Merge '[Backport 2025.4] cql3: Fix std::bad_cast when deserializing vectors of collections' from Scylladb[bot]
cql3: Fix std::bad_cast when deserializing vectors of collections

This PR fixes a bug where attempting to INSERT a vector containing collections (e.g., `vector<set<int>,1>`) would fail. On the client side, this manifested as a `ServerError: std::bad_cast`.

The cause was "type slicing" issue in the reserialize_value function. When retrieving the vector's element type, the result was being assigned by value (using auto) instead of by reference.
This "sliced" the polymorphic abstract_type object, stripping it of its actual derived type information. As a result, a subsequent dynamic_cast would fail, even if the underlying type was correct.

To prevent this entire class of bugs from happening again, I've made the polymorphic base class `abstract_type` explicitly uncopyable.

Fixes: #26704

This fix needs to be backported as these releases are affected: `2025.4` , `2025.3`.

- (cherry picked from commit 960fe3da60)

- (cherry picked from commit 77da4517d2)

Parent PR: #26740

Closes scylladb/scylladb#26998

* github.com:scylladb/scylladb:
  cql3: Make abstract_type explicitly noncopyable
  cql3: Fix std::bad_cast when deserializing vectors of collections
2025-11-14 10:17:55 +02:00
Botond Dénes
cee0011220 Merge '[Backport 2025.4] tablet_allocator: allow merges in base tables if rf-rack-valid=true' from Scylladb[bot]
Tablet merge of base tables is only safe if there is at most one replica in each rack. For more details on why it is the case please see scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on, this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273

Marked for backport to 2025.4 as MVs are getting un-experimentaled there.

- (cherry picked from commit 189ad96728)

- (cherry picked from commit 359ed964e3)

- (cherry picked from commit a8d92f2abd)

Parent PR: #26278

Closes scylladb/scylladb#26632

* github.com:scylladb/scylladb:
  test: mv: add a test for tablet merge
  tablet_allocator, tests: remove allow_tablet_merge_with_views injection
  tablet_allocator: allow merges in base tables if rf-rack-valid=true
2025-11-14 10:10:59 +02:00
Piotr Dulikowski
482e003234 Merge '[Backport 2025.4] [schema] Speculative retry rounding fix' from Scylladb[bot]
This patch series re-enables support for speculative retry values `0` and `100`. These values have been supported some time ago, before [schema: fix issue 21825: add validation for PERCENTILE values in speculative_retry configuration. #21879
](https://github.com/scylladb/scylladb/pull/21879). When that PR prevented using invalid `101PERCENTILE` values, valid `100PERCENTILE` and `0PERCENTILE` value were prevented too.

Reproduction steps from [[Bug]: drop schema and all tables after apply speculative_retry = '99.99PERCENTILE' #26369](https://github.com/scylladb/scylladb/issues/26369) are unable to reproduce the issue after the fix. A test is added to make sure the inclusive border values `0` and `100` are supported.

Documentation is updated to give more information to the users. It now states that these border values are inclusive, and also that the precision, with automatic rounding, is 1 decimal digit.

Fixes #26369

This is a bug fix. If at any time a client tries to use value >= 99.5 and < 100, the raft error will happen. Backport is needed. The code which introduced inconsistency is introduced in 2025.2, so no backporting to 2025.1.

- (cherry picked from commit da2ac90bb6)

- (cherry picked from commit 5d1913a502)

- (cherry picked from commit aba4c006ba)

- (cherry picked from commit 85f059c148)

- (cherry picked from commit 7ec9e23ee3)

Parent PR: #26909

Closes scylladb/scylladb#27015

* github.com:scylladb/scylladb:
  test: cqlpy: add test case for non-numeric PERCENTILE value
  schema: speculative_retry: update exception type for sstring ops
  docs: cql: ddl.rst: update speculative-retry-options
  test: cqlpy: add test for valid speculative_retry values
  schema: speculative_retry: allow 0 and 100 PERCENTILE values
2025-11-14 08:34:15 +01:00
Dario Mirovic
8c4033b78c test: cqlpy: add test case for non-numeric PERCENTILE value
Add test case for non-numeric PERCENTILE value, which raises an error
different to the out-of-range invalid values. Regex in the test
test_invalid_percentile_speculative_retry_values is expanded.

Refs #26369

(cherry picked from commit 7ec9e23ee3)
2025-11-13 19:45:17 +00:00
Dario Mirovic
e082df4324 schema: speculative_retry: update exception type for sstring ops
Change speculative_retry::to_sstring and speculative_retry::from_sstring
to throw exceptions::configuration_exception instead of std::invalid_argument.
These errors can be triggered by CQL, so appropriate CQL exception should be
used.
Reference: https://github.com/scylladb/scylladb/issues/24748#issuecomment-3025213304

Refs #26369

(cherry picked from commit 85f059c148)
2025-11-13 19:45:16 +00:00
Dario Mirovic
e86af5d2dd docs: cql: ddl.rst: update speculative-retry-options
Clarify how the value of `XPERCENTILE` is handled:
- Values 0 and 100 are supported
- The percentile value is rounded to the nearest 0.1 (1 decimal place)

Refs #26369

(cherry picked from commit aba4c006ba)
2025-11-13 19:45:16 +00:00
Dario Mirovic
780fd0ffe9 test: cqlpy: add test for valid speculative_retry values
test_valid_percentile_speculative_retry_values is introduced to test that
valid values for speculative_retry are properly accepted.

Some of the values are moved from the
test_invalid_percentile_speculative_retry_values test, because
the previous commit added support for them.

Refs #26369

(cherry picked from commit 5d1913a502)
2025-11-13 19:45:16 +00:00
Dario Mirovic
7e5275d03a schema: speculative_retry: allow 0 and 100 PERCENTILE values
This patch allows specifying 0 and 100 PERCENTILE values in speculative_retry.
It was possible to specify these values before #21825. #21825 prevented specifying
invalid values, like -1 and 101, but also prevented using 0 and 100.

On top of that, speculative_retry::to_sstring function did rounding when
formatting the string, which introduced inconsistency.

Fixes #26369

(cherry picked from commit da2ac90bb6)
2025-11-13 19:45:16 +00:00
Karol Nowacki
28cb5c5a4c cql3: Make abstract_type explicitly noncopyable
The polymorphic abstract_type class serves as an interface and should not be copied.
To prevent accidental and unsafe copies, make it explicitly uncopyable.

(cherry picked from commit 77da4517d2)
2025-11-13 11:53:04 +01:00
Karol Nowacki
0c0d3e798c cql3: Fix std::bad_cast when deserializing vectors of collections
When deserializing a vector whose elements are collections (e.g., set, list),
the operation raises a `std::bad_cast` exception.

This was caused by type slicing due to an incorrect assignment of a
polymorphic type by value instead of by reference. This resulted in a
failed `dynamic_cast` even when the underlying type was correct.

(cherry picked from commit 960fe3da60)
2025-11-13 11:52:55 +01:00
Yaron Kaikov
a538050657 install-dependencies.sh: update node_exporter to 1.10.2
Update node exporter to solve CVE-2025-22871

[regenerate frozen toolchain with optimized clang from
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]
Fixes: https://scylladb.atlassian.net/browse/SCYLLADB-5

Closes scylladb/scylladb#26916

(cherry picked from commit c601371b57)

Closes scylladb/scylladb#26953
2025-11-12 13:04:39 +02:00
Yaron Kaikov
b4a8f9fca0 auto-backport: Add support for JIRA issue references
- Added support for JIRA issue references in PR body and commit messages
- Supports both short format (PKG-92) and full URL format
- Maintains existing GitHub issue reference support
- JIRA pattern matches https://scylladb.atlassian.net/browse/{PROJECT-ID}
- Allows backporting for PRs that reference JIRA issues with 'fixes' keyword

Fixes: https://github.com/scylladb/scylladb/issues/26955

Closes scylladb/scylladb#26954

(cherry picked from commit 3ade3d8f5b)

Closes scylladb/scylladb#26966
2025-11-12 12:34:51 +02:00
Michał Chojnowski
db0e209a9b sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key
_last_key is a multi-fragment buffer.

Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.

The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.

But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).

Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.

Fix that.

We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.

Fixes scylladb/scylladb#26819

Closes scylladb/scylladb#26839

(cherry picked from commit b82c2aec96)

Closes scylladb/scylladb#26903
2025-11-11 10:40:01 +03:00
Raphael S. Carvalho
4361e41728 sstables_loader: Don't bypass synchronization with busy topology
The patch c543059f86 fixed the synchronization issue between tablet
split and load-and-stream. The synchronization worked only with
raft topology, and therefore was disabled with gossip.
To do the check, storage_service::raft_topology_change_enabled()
but the topology kind is only available/set on shard 0, so it caused
the synchronization to be bypassed when load-and-stream runs on
any shard other than 0.

The reason the reproducer didn't catch it is that it was restricted
to single cpu. It will now run with multi cpu and catch the
problem observed.

Fixes #22707

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#26730

(cherry picked from commit 7f34366b9d)

Closes scylladb/scylladb#26851
2025-11-11 10:39:39 +03:00
Ran Regev
1857721717 nodetool refresh primary-replica-only
Fixes: #26440

1. Added description to primary-replica-only option
2. Fixed code text to better reflect the constrained cheked in the code
   itself. namely: that both primary replica only and scope must be
applied only if load and steam is applied too, and that they are mutual
exclusive to each other.
Note: when https://github.com/scylladb/scylladb/issues/26584 is
implemented (with #26609) there will be a need to align the docs as
well - namely, primary-replica-only and scope will no longer be
mutual exclusive

Signed-off-by: Ran Regev <ran.regev@scylladb.com>

Closes scylladb/scylladb#26480

(cherry picked from commit aaf53e9c42)

Closes scylladb/scylladb#26906
2025-11-07 16:49:31 +02:00
Pavel Emelyanov
9d997458d7 Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.

This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization`  from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.

The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages,  the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.

Fixes #25308

This is a feature that users need, so it should probably be backported to live branches.

Closes scylladb/scylladb#25457

* github.com:scylladb/scylladb:
  docs/alternator: explain alternator_warn_authorization
  test/alternator: tests for new auth failure metrics and log messages
  alternator: add alternator_warn_authorization config

(cherry picked from commit 59019bc9a9)

Closes scylladb/scylladb#26894
2025-11-07 16:48:50 +02:00
Botond Dénes
8e7652d986 Merge '[Backport 2025.4] storage_proxy: use gates to track write handlers destruction' from Scylladb[bot]
In [#26408](https://github.com/scylladb/scylladb/pull/26408) a `write_handler_destroy_promise` class was introduced to wait for `abstract_write_response_handler` instances destruction. We strived to minimize the memory footprint of `abstract_write_response_handler`, with `write_handler_destroy_promise`-es we required only a single additional int. It turned our that in some cases a lot of write handlers can be scheduled for deletion at the same time, in such cases the vector can become big and cause 'oversized allocation' seastar warnings.

Another concern with `write_handler_destroy_promise`-es [was that they were more complicated than it was worth](https://github.com/scylladb/scylladb/pull/26408#pullrequestreview-3361001103).

In this commit we replace `write_handler_destroy_promise` with simple gates. One or more gates can be attached to an `abstract_write_response_handler` to wait for its destruction. We use `utils::small_vector` to store the attached gates. The limit 2 was chosen because we expect two gates at the same time in most cases. One is `storage_proxy::_write_handlers_gate`, which is used to wait for all handlers in `cancel_all_write_response_handlers`. Another one can be attached by a caller of `cancel_write_handlers`. Nothing stops several cancel_write_handlers to be called at the same time, but it should be rare.

The `sizeof(utils::small_vector) == 40`, this is `40.0 / 488 * 100 ~ 8%` increase in `sizeof(abstract_write_response_handler)`, which seems acceptable.

Fixes [scylladb/scylladb#26788](https://github.com/scylladb/scylladb/issues/26788)

backport: need to backport to 2025.4 (LWT for tablets release)

- (cherry picked from commit 4578304b76)

- (cherry picked from commit 5bda226ff6)

Parent PR: #26827

Closes scylladb/scylladb#26893

* github.com:scylladb/scylladb:
  storage_proxy: use coroutine::maybe_yield();
  storage_proxy: use gates to track write handlers destruction
2025-11-07 16:48:02 +02:00
Wojciech Mitros
8e9ab7618e mv: don't mark the view as built if the reader produced no partitions
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")

If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`

In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.

Fixes https://github.com/scylladb/scylladb/issues/26523

Closes scylladb/scylladb#26635

(cherry picked from commit 0a22ac3c9e)

Closes scylladb/scylladb#26889
2025-11-07 16:46:46 +02:00
Asias He
1288adb3a6 repair: Add metric for time spent on tablet repair
It is useful to check time spent on tablet repair. It can be used to
compare incremental repair and non-incremental repair. The time does not
include the time waiting for the tablet scheduler to schedule the tablet
repair task.

Fixes #26505

Closes scylladb/scylladb#26502

(cherry picked from commit dbeca7c14d)

Closes scylladb/scylladb#26881
2025-11-07 16:45:32 +02:00
Wojciech Mitros
092be62fca view_building_coordinator: rollback tasks on the leaving tablet replica
When a tablet migration is started, we abort the corresponding view
building tasks (i.e. we change the state of those tasks to "ABORTED").
However, we don't change the host and shard of these tasks until the
migration successfully completes. When for some reason we have to
rollback the migration, that means the migration didn't finish and
the aborted task still has the host and shard of the migration
source. So when we recreate tasks that should no longer be aborted
due to a rolled-back migration, we should look at the aborted tasks
of the source (leaving) replica. But we don't do it and we look at
the aborted tasks of the target replica.
In this patch we adjust the rollback mechanism to recreate tasks
for the migration source instead of destination. We also fix the
test that should have detected this issue - the injection that
the test was using didn't make us rollback, but we simply retried
a stage of the tablet migration. By using one_shot=False and adding
a second injection, we can now guarantee that the migration will
eventually fail and we'll continue to the 'cleanup_target' and
'revert_migration' stages.

Fixes https://github.com/scylladb/scylladb/issues/26691

Closes scylladb/scylladb#26825

(cherry picked from commit 977fa91e3d)

Closes scylladb/scylladb#26879
2025-11-07 16:44:04 +02:00
Botond Dénes
c70656c0db Merge '[Backport 2025.4] db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Scylladb[bot]
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).

Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.

Fixes #26610.

- (cherry picked from commit d95ebe7058)

- (cherry picked from commit 96e727d7b9)

- (cherry picked from commit 2fc812a1b9)

- (cherry picked from commit a0bf932caa)

Parent PR: #26697

Closes scylladb/scylladb#26830

* github.com:scylladb/scylladb:
  test/cluster: Add test for default SSTable compressor
  db/config: Change default SSTable compressor to LZ4WithDictsCompressor
  db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
  boost/cql_query_test: Get expected compressor from config
2025-11-07 16:43:11 +02:00
Botond Dénes
bb142bdc10 Merge '[Backport 2025.4] auth: implement vector store authorization' from Scylladb[bot]
This patch implements the changes required by the Vector Store authorization, as described in https://scylladb.atlassian.net/wiki/spaces/RND/pages/107085899/Vector+Store+Authentication+And+Authorization+To+ScyllaDB, that is:

- adding a new permission VECTOR_SEARCH_INDEXING, grantable only on ALL KEYSPACES
- allowing users with that permission to perform SELECT queries, but only on tables with a vector index
- increasing the number of scheduling groups by one to allow users to create a service level for a vector store user
- adjusting the tests and documentation

These changes are needed, as the vector indexes are managed by the external service, Vector Store, which needs to read the tables to create the indexes in its memory. We would like to limit the privileges of that service to a minimum to maintain the principle of least privilege, therefore a new permission, one that allows the SELECTs conditional on the existence of a vector_index on the table.

Fixes: VECTOR-201
Fixes: https://github.com/scylladb/scylladb/issues/26804

Backport reasoning:
Backport to 2025.4 required as this can make upgrading clusters more difficult if we add it in 2026.1. As for now Scylla Cloud requires version 2025.4 to enable vector search and permission is set by orchestrator so there is no chance that someone will try to add this permission during upgrade. In 2026.1 it will be more difficult.

- (cherry picked from commit ae86bfadac)

- (cherry picked from commit 3025a35aa6)

- (cherry picked from commit 6a69bd770a)

- (cherry picked from commit e8fb745965)

- (cherry picked from commit 3db2e67478)

Parent PR: #25976

Closes scylladb/scylladb#26805

* github.com:scylladb/scylladb:
  docs: adjust docs for VS auth changes
  test: add tests for VECTOR_SEARCH_INDEXING permission
  cql: allow VECTOR_SEARCH_INDEXING users to select
  auth: add possibilty to check for any permission in set
  auth: add a new permission VECTOR_SEARCH_INDEXING
2025-11-07 16:42:33 +02:00
Piotr Dulikowski
b52055b7a1 Merge '[Backport 2025.4] transport: call update_scheduling_group for non-auth connections' from Andrzej Jackowski
This is backport of fix for https://github.com/scylladb/scylladb/issues/26040 and related test (https://github.com/scylladb/scylladb/pull/26589) to 2025.4.

Before this change, unauthorized connections stayed in main
scheduling group. It is not ideal, in such case, rather sl:default
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to update_scheduling_group at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to sl:default.

Fixes: https://github.com/scylladb/scylladb/issues/26040
Fixes: https://github.com/scylladb/scylladb/issues/26581

(cherry picked from commit 278019c328)
(cherry picked from commit 8642629e8e)

No backport, as it's already a backport

Closes scylladb/scylladb#26815

* github.com:scylladb/scylladb:
  test: add test_anonymous_user to test_raft_service_levels
  transport: call update_scheduling_group for non-auth connections
2025-11-07 12:58:43 +01:00
Petr Gusev
5c7a9e89ea storage_proxy: use coroutine::maybe_yield();
This is a small "while at it" refactoring -- better to use
coroutine::maybe_yield with co_await-s.

(cherry picked from commit 5bda226ff6)
2025-11-06 13:04:52 +00:00
Petr Gusev
2e55f9521b storage_proxy: use gates to track write handlers destruction
In #26408 a write_handler_destroy_promise class was introduced to
wait for abstract_write_response_handler instances destruction. We
strived to minimize the memory footprint of
abstract_write_response_handler, with write_handler_destroy_promise-es
we required only a single additional int. It turned our that in some
cases a lot of write handlers can be scheduled for deletion
at the same time, in such cases the
vector<write_handler_destroy_promise> can become big and cause
'oversized allocation' seastar warnings.

Another concern with write_handler_destroy_promise-es was that they
were more complicated than it was worth.

In this commit we replace write_handler_destroy_promise with simple
gates. One or more gates can be attached to an
abstract_write_response_handler to wait for its destruction. We use
utils::small_vector<gate::holder, 2> to store the attached gates.
The limit 2 was chosen because we expect two gates at the same time
in most cases. One is storage_proxy::_write_handlers_gate,
which is used to wait for all handlers in
cancel_all_write_response_handlers. Another one can be attached by
a caller of cancel_write_handlers. Nothing stops several
cancel_write_handlers to be called at the same time, but it should be
rare.

The sizeof(utils::small_vector<gate::holder, 2>) == 40, this is
40.0 / 488 * 100 ~ 8% increase in
sizeof(abstract_write_response_handler), which seems acceptable.

Fixes scylladb/scylladb#26788

(cherry picked from commit 4578304b76)
2025-11-06 13:04:52 +00:00
Aleksandra Martyniuk
4c71ff1506 api: storage_service: tasks: unify force_keyspace_compaction
Currently, all apis that start a compaction have two versions:
synchronous and asynchronous. They share most of the implementation,
but some checks and params have diverged.

Add consider_only_existing_data parameter to /tasks/compaction/keyspace_compaction/{keyspace},
to match the synchronous version of the api (/storage_service/keyspace_compaction/{keyspace}).

Unify the handlers of both apis.

(cherry picked from commit 12dabdec66)
2025-11-06 09:51:16 +00:00
Nikos Dragazis
64a0efe297 test/cluster: Add test for default SSTable compressor
The previous patch made the default compressor dependent on the
SSTABLE_COMPRESSION_DICTS feature:
* LZ4Compressor if the feature is disabled
* LZ4WithDictsCompressor if the feature is enabled

Add a test to verify that the cluster uses the right default in every
case.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit a0bf932caa)
2025-11-04 15:41:40 +02:00
Nikos Dragazis
3b801f3d80 db/config: Change default SSTable compressor to LZ4WithDictsCompressor
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).

Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 2fc812a1b9)
2025-11-04 15:41:40 +02:00
Nikos Dragazis
bafe2bbbbc db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435

As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.

Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.

For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit 96e727d7b9)
2025-11-04 15:40:46 +02:00
Yauheni Khatsianevich
1f5082d1ce tests(lwt): new test for LWT testing during tablet resize
– Workload: N workers perform CAS updates
 UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
 at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
 as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
 flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.

Closes scylladb/scylladb#26113

refs: scylladb/qa-tasks#1918
Refs #18068
Fixes #24502 (to satisfy backport rules)

(cherry picked from commit 99dc31e71a)

Closes scylladb/scylladb#26790
2025-11-04 12:47:24 +01:00
Michael Litvak
8f7a6fd5eb cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792

(cherry picked from commit e7dbccd59e)

Closes scylladb/scylladb#26828
2025-11-04 12:41:30 +01:00
Petr Gusev
0656a73c52 paxos_state: get_replica_lock: remove shard check
This check is incorrect: the current shard may be looking at
the old version of tablets map:
* an accept RPC comes to replica shard 0, which is already at write_both_read_new
* the new shard is shard 1, so paxos_state::accept is called on shard 1
* shard 1 is still at "streaming" -> shards_ready_for_reads() returns old
shard 0

Fixes scylladb/scylladb#26801

Closes scylladb/scylladb#26809

(cherry picked from commit 88765f627a)

Closes scylladb/scylladb#26832
2025-11-02 11:14:48 +01:00
Jenkins Promoter
527124d363 Update pgo profiles - aarch64 2025-11-01 04:55:16 +02:00
Jenkins Promoter
0b30c111d6 Update pgo profiles - x86_64 2025-11-01 04:20:45 +02:00
Nikos Dragazis
260c9972b0 boost/cql_query_test: Get expected compressor from config
Since 5b6570be52, the default SSTable compression algorithm for user
tables is no longer hardcoded; it can be configured via the
`sstable_compression_user_table_options.sstable_compression` option in
scylla.yaml.

Modify the `test_table_compression` test to get the expected value from
the configuration.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
(cherry picked from commit d95ebe7058)
2025-10-31 23:50:20 +00:00
Andrzej Jackowski
1c04ce3415 test: add test_anonymous_user to test_raft_service_levels
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.

Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.

Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.

This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.

Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581

Closes scylladb/scylladb#26589

(cherry picked from commit 8642629e8e)
2025-10-30 20:10:47 +01:00
Andrzej Jackowski
af8df987f5 transport: call update_scheduling_group for non-auth connections
Before this change, unauthorized connections stayed in `main`
scheduling group. It is not ideal, in such case, rather `sl:default`
should be used, to have a consistent behavior with a scenario
where users is authenticated but there is no service level assigned
to the user.

This commit adds a call to `update_scheduling_group` at the end of
connection creation for an unauthenticated user, to make sure the
service level is switched to `sl:default`.

Fixes: scylladb/scylladb#26040
(cherry picked from commit 278019c328)
2025-10-30 19:34:18 +01:00
Michał Hudobski
09d59de736 docs: adjust docs for VS auth changes
We adjust the documentation to include the new
VECTOR_SEARCH_INDEXING permission and its usage
and also to reflect the changes in the maximal
amount of service levels.

(cherry picked from commit 3db2e67478)
2025-10-30 10:13:16 +00:00
Michał Hudobski
ce04e2cb7d test: add tests for VECTOR_SEARCH_INDEXING permission
This commit adds tests to verify the expected
behavior of the VECTOR_SEARCH_INDEXING permission,
that is, allowing GRANTing this permission only on
ALL KEYSPACES and allowing SELECT queries only on tables
with vector indexes when the user has this permission

(cherry picked from commit e8fb745965)
2025-10-30 10:13:16 +00:00
Michał Hudobski
3367ffa14f cql: allow VECTOR_SEARCH_INDEXING users to select
This patch allows users with the VECTOR_SEARCH_INDEXING permission
to perform SELECT queries on tables that have a vector index.
This is needed for the Vector Store service, which
reads the vector-indexed tables, but does not require
the full SELECT permission.

(cherry picked from commit 6a69bd770a)
2025-10-30 10:13:16 +00:00
Michał Hudobski
ad1e5718ae auth: add possibilty to check for any permission in set
This commit adds a new version of command_desc struct
that contains a set of permissions instead of a singular
permission. When this struct is passed to ensure/check_has_permission,
we check if the user has any of the included permission on the resource.

(cherry picked from commit 3025a35aa6)
2025-10-30 10:13:16 +00:00
Michał Hudobski
27bf0fdf60 auth: add a new permission VECTOR_SEARCH_INDEXING
This patch adds a new permission: VECTOR_SEARCH_INDEXING,
that is grantable only for ALL KEYSPACES. It will allow selecting
from tables with vector search indexes. It is meant to be used
by the Vector Store service to allow it to build indexes without
having full SELECT permissions on the tables.

(cherry picked from commit ae86bfadac)
2025-10-30 10:13:16 +00:00
Pavel Emelyanov
9459a58116 Merge '[Backport 2025.4] cdc: improve cdc metadata loading' from Scylladb[bot]
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes https://github.com/scylladb/scylladb/issues/26732

backport to 2025.4 where cdc with tablets is introduced

- (cherry picked from commit 8743422241)

- (cherry picked from commit 4cc0a80b79)

Parent PR: #26160

Closes scylladb/scylladb#26798

* github.com:scylladb/scylladb:
  test: cdc: extend cdc with tablets tests
  cdc: improve cdc metadata loading
2025-10-30 10:32:27 +03:00
Michael Litvak
59f97d0b71 test: cdc: extend cdc with tablets tests
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.

(cherry picked from commit 4cc0a80b79)
2025-10-30 02:44:47 +00:00
Michael Litvak
0a07c2cb19 cdc: improve cdc metadata loading
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes scylladb/scylladb#26732

(cherry picked from commit 8743422241)
2025-10-30 02:44:47 +00:00
Piotr Dulikowski
8d952fdbec test: mv: add a test for tablet merge
The test test_mv_tablets_replace verifies that merging tablets of both a
view and its base table is allowed if rf-rack-valid-keyspaces option is
enabled (and it is enabled by default in the test suite).

(cherry picked from commit a8d92f2abd)
2025-10-29 11:58:03 +01:00
Piotr Dulikowski
a0c1c72774 tablet_allocator, tests: remove allow_tablet_merge_with_views injection
The `allow_tablet_merge_with_views` error injection was previously used
to allow merging tablets in a table which has materialized views
attached to it. Now, the error injection is not needed because this is
allowed under the rf-rack-valid condition, which is enabled by default
in tests.

Remove the error injection from the code and adjust the tests not to use
it.

(cherry picked from commit 359ed964e3)
2025-10-29 11:58:02 +01:00
Pavel Emelyanov
080c55a115 lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595

(cherry picked from commit d9bfbeda9a)

Closes scylladb/scylladb#26767
2025-10-29 11:29:57 +02:00
Anna Stuchlik
93de570e33 doc: add --list-active-releases to Web Installer
Fixes https://github.com/scylladb/scylladb/issues/26688

V2 of https://github.com/scylladb/scylladb/pull/26687

Closes scylladb/scylladb#26689

(cherry picked from commit bd5b966208)

Closes scylladb/scylladb#26765
2025-10-29 11:28:51 +02:00
Patryk Jędrzejczak
680bfa9ab7 test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638

(cherry picked from commit 5321720853)

Closes scylladb/scylladb#26763
2025-10-29 11:27:49 +02:00
Anna Stuchlik
ed58815199 doc: add OS support for version 2025.4
Fixes https://github.com/scylladb/scylladb/issues/26450

Closes scylladb/scylladb#26616

(cherry picked from commit 6fa342fb18)

Closes scylladb/scylladb#26750
2025-10-29 11:27:08 +02:00
Botond Dénes
9c8812a154 Merge '[Backport 2025.4] LWT: use shards_ready_for_reads for replica locks' from Scylladb[bot]
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.

The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).

Fixes scylladb/scylladb#26727

backport: need to backport to 2025.4

- (cherry picked from commit 5ab2db9613)

- (cherry picked from commit 478f7f545a)

Parent PR: #26719

Closes scylladb/scylladb#26748

* github.com:scylladb/scylladb:
  paxos_state: use shards_ready_for_reads
  paxos_state: inline shards_for_writes into get_replica_lock
2025-10-29 11:26:29 +02:00
Botond Dénes
aac49601c6 Merge '[Backport 2025.4] cdc: garbage collect CDC streams for tablets' from Scylladb[bot]
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.

the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.

in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.

Fixes https://github.com/scylladb/scylladb/issues/26669

- (cherry picked from commit 440caeabcb)

- (cherry picked from commit 6109cb66be)

Parent PR: #26410

Closes scylladb/scylladb#26728

* github.com:scylladb/scylladb:
  cdc: garbage collect CDC streams periodically
  cdc: helpers for garbage collecting old streams for tablets
2025-10-29 11:25:31 +02:00
Asias He
89364d3576 repair: Remove the regular mode name in the tablet repair api
The patch e34deb72f9 (repair: Rename incremental mode name)
missed one place that references the removed regular mode name.

Fixes #26503

Closes scylladb/scylladb#26660

(cherry picked from commit 5f1febf545)

Closes scylladb/scylladb#26684
2025-10-29 11:22:56 +02:00
Anna Stuchlik
68ea778b6b doc: add support for Debian 12
Fixes https://github.com/scylladb/scylladb/issues/26640

Closes scylladb/scylladb#26668

(cherry picked from commit 9c0ff7c46b)

Closes scylladb/scylladb#26681
2025-10-29 11:22:29 +02:00
Botond Dénes
087f739bf9 Merge '[Backport 2025.4] alternator/executor: instantly mark view as built when creating it with base table' from Scylladb[bot]
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.

In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).

Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.

Fixes scylladb/scylladb#26615

This fix should be backported to 2025.4.

- (cherry picked from commit 8fbf122277)

- (cherry picked from commit bdab455cbb)

- (cherry picked from commit 34503f43a1)

Parent PR: #26657

Closes scylladb/scylladb#26670

* github.com:scylladb/scylladb:
  test/alternator/test_tablets: add test for GSI backfill with tablets
  test/alternator/test_tablets: add reproducer for GSI with tablets
  alternator/executor: instantly mark view as built when creating it with base table
2025-10-29 11:21:27 +02:00
Petr Gusev
332b776e87 paxos_state: use shards_ready_for_reads
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.

Fixes scylladb/scylladb#26727

(cherry picked from commit 478f7f545a)
2025-10-28 16:59:47 +00:00
Petr Gusev
ff0e7ac853 paxos_state: inline shards_for_writes into get_replica_lock
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.

Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.

(cherry picked from commit 5ab2db9613)
2025-10-28 16:59:47 +00:00
Michael Litvak
5319759bdb cdc: garbage collect CDC streams periodically
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.

(cherry picked from commit 6109cb66be)
2025-10-27 19:53:04 +00:00
Michael Litvak
55d9d5e7c2 cdc: helpers for garbage collecting old streams for tablets
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
  all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
  builds mutations that update the internal cdc tables and remove the
  older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
  above to find a new base and build mutations to update it for a
  specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables

(cherry picked from commit 440caeabcb)
2025-10-27 19:53:04 +00:00
Jenkins Promoter
7f08d0a6cf Update ScyllaDB version to: 2025.4.0-rc4 2025-10-27 14:57:11 +02:00
Patryk Jędrzejczak
c406e1dd17 Merge '[Backport 2025.4] raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Scylladb[bot]
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.

We fix this issue in this PR and add a regression test.

This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.

Fixes #26534

This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).

- (cherry picked from commit 1d09b9c8d0)

- (cherry picked from commit 6b2e003994)

- (cherry picked from commit c57f097630)

Parent PR: #26612

Closes scylladb/scylladb#26682

* https://github.com/scylladb/scylladb:
  test: test group0 tombstone GC in the Raft-based recovery procedure
  group0_state_id_handler: remove unused group0_server_accessor
  group0_state_id_handler: consider state IDs of all non-ignored topology members
2025-10-27 10:15:49 +01:00
Avi Kivity
e85ab70054 Merge '[Backport 2025.4] tablet_metadata_guard: fix split/merge handling' from Petr Gusev
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the tablet_id field (_tablet), which means the guard can no longer correctly protect ongoing operations from tablet migrations.

The problem is specific to LWT, since tablet_metadata_guard is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the token_metadata_guard wrapper).

Fixes https://github.com/scylladb/scylladb/issues/26437

backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.

(cherry picked from commit e1667afa50)

(cherry picked from commit 6f4558ed4b)

(cherry picked from commit 64ba427b85)

(cherry picked from commit ec6fba35aa)

(cherry picked from commit b23f2a2425)

(cherry picked from commit 33e9ea4a0f)

(cherry picked from commit 03d6829783)

Parent PR: https://github.com/scylladb/scylladb/pull/26619

Closes scylladb/scylladb#26700

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_tablets_merge_waits_for_lwt
  test.py: add universalasync_typed_wrap
  tablet_metadata_guard: fix split/merge handling
  tablet_metadata_guard: add debug logs
  paxos_state: shards_for_writes: improve the error message
  storage_service: barrier_and_drain – change log level to info
  topology_coordinator: fix log message
2025-10-24 21:22:49 +03:00
Petr Gusev
41f8f6b571 test_tablets_lwt: add test_tablets_merge_waits_for_lwt
(cherry picked from commit 03d6829783)
2025-10-24 12:22:20 +02:00
Petr Gusev
31e4bb1bc3 test.py: add universalasync_typed_wrap
The universalasync.wrap function doesn't preserve the
type information, which confuses the VS Code Pylance
plugin and makes code navigation hard.

In this commit we fix the problem by adding a typed
wrapped around universalasync.wrap.

Fixes: scylladb/scylladb#26639
(cherry picked from commit 33e9ea4a0f)
2025-10-24 12:21:21 +02:00
Petr Gusev
be94aab207 tablet_metadata_guard: fix split/merge handling
The guard should stop refreshing the ERM when the number of tablets
changes. Tablet splits or merges invalidate the tablet_id field
(_tablet), which means the guard can no longer correctly protect
ongoing operations from tablet migrations.

Fixes scylladb/scylladb#26437

(cherry picked from commit b23f2a2425)
2025-10-24 12:21:21 +02:00
Petr Gusev
a5be65785c tablet_metadata_guard: add debug logs
(cherry picked from commit ec6fba35aa)
2025-10-24 12:21:21 +02:00
Petr Gusev
5720dd52b8 paxos_state: shards_for_writes: improve the error message
Add the current token and tablet info, remove 'this_shard_id'
since it's always written by the logging infrastructure.

(cherry picked from commit 64ba427b85)
2025-10-24 12:21:21 +02:00
Petr Gusev
aa2021888c storage_service: barrier_and_drain – change log level to info
Debugging global barrier issues is difficult without these logs.
Since barriers do not occur frequently, increasing the log level should not produce excessive output.

(cherry picked from commit 6f4558ed4b)
2025-10-24 12:21:21 +02:00
Petr Gusev
a09c1b355e topology_coordinator: fix log message
(cherry picked from commit e1667afa50)
2025-10-24 12:21:21 +02:00
Pawel Pery
67e0c8e4b0 vector_search: fix flaky dns_refresh_aborted test
The test process like that:
- run long dns refresh process
- request for the resolve hostname with short abort_source timer - result
  should be empty list, because of aborted request

The test sometimes finishes long dns refresh before abort_source fired and the
result list is not empty.

There are two issues. First, as.reset() changes the abort_source timeout. The
patch adds a get() method to the abort_source_timeout class, so there is no
change in the abort_source timeout. Second, a sleep could be not reliable. The
patch changes the long sleep inside a dns refresh lambda into
condition_variable handling, to properly signal the end of the dns refresh
process.

Fixes: #26561
Fixes: VECTOR-268

It needs to be backported to 2025.4

Closes scylladb/scylladb#26566

(cherry picked from commit 10208c83ca)

Closes scylladb/scylladb#26598
2025-10-23 11:24:32 +02:00
Piotr Dulikowski
03d57bae80 Merge '[Backport 2025.4] storage_proxy: wait for write handlers destruction' from Scylladb[bot]
`shared_ptr<abstract_write_response_handler>` instances are captured in the `lmutate` and `rmutate` lambdas of `send_to_live_endpoints()`. As a result, an `abstract_write_response_handler` object may outlive its removal from the `storage_proxy::_response_handlers` map -> `cancel_all_write_response_handlers()` doesn't actually wait for requests completion -> `sp::drain_on_shutdown()` doesn't guarantee all requests are drained -> `sp::stop_remote()` completes too early and `paxos_store` is destroyed while LWT local writes might still be in progress. In this PR we introduce a `write_handler_destroy_promise` to wait for such pending instances in `cancel_write_handlers()` and `cancel_all_write_response_handlers()` to prevent the `use-after-free`.

A better long-term solution might be to replace `shared_ptr` with `unique_ptr` for `abstract_write_response_handler` and use a separate gate to track the `lmutate/rmutate` lambdas. We do not actually need to wait for these lambdas to finish before sending a timeout or error response to the client, as we currently do in `~abstract_write_response_handler`.

Fixes scylladb/scylladb#26355

backport: need to be backported to 2025.4 since #26355 is reproduced on LWT over tablets

- (cherry picked from commit bf2ac7ee8b)

- (cherry picked from commit b269f78fa6)

- (cherry picked from commit bbcf3f6eff)

- (cherry picked from commit 8925f31596)

Parent PR: #26408

Closes scylladb/scylladb#26658

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_lwt_shutdown
  storage_proxy: wait for write handler destruction
  storage_proxy: coroutinize cancel_write_handlers
  storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
2025-10-23 10:49:52 +02:00
Patryk Jędrzejczak
76560ca095 test: test group0 tombstone GC in the Raft-based recovery procedure
We add a regression test for the bug fixed in the previous commits.

(cherry picked from commit c57f097630)
2025-10-22 17:13:34 +00:00
Patryk Jędrzejczak
8a11535a12 group0_state_id_handler: remove unused group0_server_accessor
It became unused in the previous commit.

(cherry picked from commit 6b2e003994)
2025-10-22 17:13:34 +00:00
Patryk Jędrzejczak
d727a086c5 group0_state_id_handler: consider state IDs of all non-ignored topology members
It's not enough to consider only the current group 0 members. In the
Raft-based recovery procedure, there can be nodes that haven't joined
the current group 0 yet, but they have belonged to a different group 0
and thus have a non-empty group 0 state ID.

We fix this issue in this commit by considering topology members
instead.

We don't consider ignored nodes as an optimization. When some nodes are
dead, the group 0 state ID handler won't have to wait until all these
nodes leave the cluster. It will only have to wait until all these nodes
are ignored, which happens at the beginning of the first
removenode/replace. As a result, tombstones of group 0 tables will be
purged much sooner.

We don't rename the `group0_members` variable to keep the change
minimal. There seems to be no precise and succinct name for the used set
of nodes anyway.

We use `std::ranges::join_view` in one place because:
- `std::ranges::concat` will become available in C++26,
- `boost::range::join` is not a good option, as there is an ongoing
  effort to minimize external dependencies in Scylla.

(cherry picked from commit 1d09b9c8d0)
2025-10-22 17:13:34 +00:00
Andrei Chekun
d1274f01aa test.py: rewrite the wait_for_first_completed
Rewrite wait_for first_completed to return only first completed task guarantee
of awaiting(disappearing) all cancelled and finished tasks
Use wait_for_first_completed to avoid false pass tests in the future and issues
like #26148
Use gather_safely to await tasks and removing warning that coroutine was
not awaited

Closes scylladb/scylladb#26435

(cherry picked from commit 24d17c3ce5)

Closes scylladb/scylladb#26663
2025-10-22 18:12:52 +02:00
Michael Litvak
aa2065fe2e storage_service: improve colocated repair error to show table names
When requesting repair for tablets of a colocated table, the request
fails with an error. Improve the error message to show the table names
instead of table IDs, because the table names are more useful for users.

Fixes scylladb/scylladb#26567

Closes scylladb/scylladb#26568

(cherry picked from commit b808d84d63)

Closes scylladb/scylladb#26624
2025-10-22 15:25:15 +02:00
Asias He
5c7eb2ac61 repair: Fix uuid and nodes_down order in the log
Fixes #26536

Closes scylladb/scylladb#26547

(cherry picked from commit 33bc1669c4)

Closes scylladb/scylladb#26630
2025-10-22 14:25:18 +02:00
Tomasz Grabiec
0621a8aee5 Merge '[Backport 2025.4] Synchronize tablet split and load-and-stream' from Scylladb[bot]
Load-and-stream is broken when running concurrently to the finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements # 1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes https://github.com/scylladb/scylladb/issues/26455.

- (cherry picked from commit 3abc66da5a)

- (cherry picked from commit 4654cdc6fd)

Parent PR: #26456

Closes scylladb/scylladb#26651

* github.com:scylladb/scylladb:
  test: Add reproducer for l-a-s and split synchronization issue
  sstables_loader: Synchronize tablet split and load-and-stream
2025-10-22 14:23:04 +02:00
Jenkins Promoter
10db3f7c85 Update ScyllaDB version to: 2025.4.0-rc3 2025-10-22 14:11:52 +03:00
Michał Jadwiszczak
f6dde0aa4b test/alternator/test_tablets: add test for GSI backfill with tablets
The test should pass without the fix for scylladb/scylladb#26615,
because the `executor::updata_table()` uses
`service::prepare_new_view_announcement()`, which creates view building
tasks for the view.

But it's better to add this test.

(cherry picked from commit 34503f43a1)
2025-10-22 10:51:55 +00:00
Michał Jadwiszczak
207c273b29 test/alternator/test_tablets: add reproducer for GSI with tablets
(cherry picked from commit bdab455cbb)
2025-10-22 10:51:54 +00:00
Michał Jadwiszczak
6df48aacd7 alternator/executor: instantly mark view as built when creating it with base table
`CreateTable` request creates GSI/LSI together with the base table,
the base table is empty and we don't need to actually build the view.

In tablet-based keyspaces we can just don't create view building tasks
and mark the view build status as SUCCESS on all nodes. Then, the view
building worker on each node will mark the view as built in
`system.built_views` (`view_building_worker::update_built_views()`).

Vnode-based keyspaces will use the "old" logic of view builder, which
will process the view and mark it as built.

Fixes scylladb/scylladb#26615

(cherry picked from commit 8fbf122277)
2025-10-22 10:51:54 +00:00
Pavel Emelyanov
45341ca246 Merge '[Backport 2025.4] s3_client: handle failures which require http::request updating' from Scylladb[bot]
Apply two main changes to the s3_client error handling
1. Add a loop to s3_client's `make_request` for the case whe the retry strategy will not help since the request itself have to be updated. For example, authentication token expiration or timestamp on the request header
2. Refine the way we handle exceptions in the `chunked_download_source` background fiber, now we carry the original `exception_ptr` and also we wrap EVERY exception in `filler_exception` to prevent retry strategy trying to retry the request altogether

Fixes: https://github.com/scylladb/scylladb/issues/26483

Should be ported back to 2025.3 and 2025.4 to prevent deadlocks and failures in these versions

- (cherry picked from commit 55fb2223b6)

- (cherry picked from commit db1ca8d011)

- (cherry picked from commit 185d5cd0c6)

- (cherry picked from commit 116823a6bc)

- (cherry picked from commit 43acc0d9b9)

- (cherry picked from commit 58a1cff3db)

- (cherry picked from commit 1d34657b14)

- (cherry picked from commit 4497325cd6)

- (cherry picked from commit fdd0d66f6e)

Parent PR: #26527

Closes scylladb/scylladb#26650

* github.com:scylladb/scylladb:
  s3_client: tune logging level
  s3_client: add logging
  s3_client: improve exception handling for chunked downloads
  s3_client: fix indentation
  s3_client: add max for client level retries
  s3_client: remove `s3_retry_strategy`
  s3_client: support high-level request retries
  s3_client: just reformat `make_request`
  s3_client: unify `make_request` implementation
2025-10-22 11:33:53 +03:00
Piotr Dulikowski
1efb2eb174 view_building_worker: access tablet map through erm on sstable discovery
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.

Fixes: scylladb/scylladb#26403

Closes scylladb/scylladb#26588

(cherry picked from commit f76917956c)

Closes scylladb/scylladb#26631
2025-10-22 11:33:22 +03:00
Pavel Emelyanov
320ef84367 Merge '[Backport 2025.4] compaction/twcs: fix use after free issues' from Scylladb[bot]
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Fixes #25913

Issue probably started after 9d3755f276, so backport to 2025.4

- (cherry picked from commit 1cd43bce0e)

- (cherry picked from commit 35159e5b02)

- (cherry picked from commit 18c071c94b)

Parent PR: #26593

Closes scylladb/scylladb#26625

* github.com:scylladb/scylladb:
  compaction: fix use after free when strategy is altered during compaction
  compaction/twcs: pass compaction_strategy_state to internal methods
  compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
2025-10-22 11:32:28 +03:00
Petr Gusev
01658f9fcb test_tablets_lwt: add test_lwt_shutdown
(cherry picked from commit 8925f31596)
2025-10-22 00:10:59 +00:00
Petr Gusev
e56f14b9c5 storage_proxy: wait for write handler destruction
shared_ptr<abstract_write_response_handler> instances are captured in
the lmutate/rmutate lambdas of send_to_live_endpoints(). As a result,
an abstract_write_response_handler object may outlive its removal from
the _response_handlers map. We use write_handler_destroy_promise to
wait for such pending instances in cancel_write_handlers() and
cancel_all_write_response_handlers() to prevent use-after-free.

A better long-term solution might be to replace shared_ptr with
unique_ptr for abstract_write_response_handler and use a separate gate
to track the lmutate/rmutate lambdas. We do not actually need to wait
for these lambdas to finish before sending a timeout or error response
to the client, as we currently do in ~abstract_write_response_handler.

Fixes scylladb/scylladb#26355

(cherry picked from commit bbcf3f6eff)
2025-10-22 00:10:59 +00:00
Petr Gusev
5865dad0c9 storage_proxy: coroutinize cancel_write_handlers
The cancel_write_handlers() method was assumed to be called in a thread
context, likely because it was first used from gossiper events, where a
thread context already existed. Later, this method was reused in
abort_view_writes() and abort_batch_writes(), where threads are created
on the fly and appear redundant.

The drain_on_shutdown() method also used a thread, justified by some
"delicate lifetime issues", but it is unclear what that actually means.
It seems that a straightforward co_await should work just fine.

(cherry picked from commit b269f78fa6)
2025-10-22 00:10:59 +00:00
Petr Gusev
388dfbe3ee storage_proxy: cancel_write_handlers: don't hold a strong pointer to handler
A strong pointer was held for the duration of thread::yield(),
preventing abstract_write_response_handler destruction and possibly
delaying the sending of timeout or error responses to the client.

This commit removes the strong pointer. Instead, we compute the
next iterator before calling timeout_cb(), so if the handler is
destroyed inside timeout_cb(), we already have a valid next iterator.

(cherry picked from commit bf2ac7ee8b)
2025-10-22 00:10:59 +00:00
Raphael S. Carvalho
92a603699e test: Add reproducer for l-a-s and split synchronization issue
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 4654cdc6fd)
2025-10-21 12:26:55 +00:00
Raphael S. Carvalho
d998d9d418 sstables_loader: Synchronize tablet split and load-and-stream
Load-and-stream is broken when running concurrently to the
finalization step of tablet split.

Consider this:
1) split starts
2) split finalization executes barrier and succeed
3) load-and-stream runs now, starts writing sstable (pre-split)
4) split finalization publishes changes to tablet metadata
5) load-and-stream finishes writing sstable
6) sstable cannot be loaded since it spans two tablets

two possible fixes (maybe both):

1) load-and-stream awaits for topology to quiesce
2) perform split compaction on sstable that spans both sibling tablets

This patch implements #1. By awaiting for topology to quiesce,
we guarantee that load-and-stream only starts when there's no
chance coordinator is handling some topology operation like
split finalization.

Fixes #26455.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
(cherry picked from commit 3abc66da5a)
2025-10-21 12:26:54 +00:00
Ernest Zaslavsky
6f6b3a26c4 s3_client: tune logging level
Change all logging related to errors in `chunked_download_source` background download fiber to `info` to make it visible right away in logs.

(cherry picked from commit fdd0d66f6e)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4eb427976b s3_client: add logging
Add logging for the case when we encounter expired credentials, shouldnt happen but just in case

(cherry picked from commit 4497325cd6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
94d49da8ec s3_client: improve exception handling for chunked downloads
Refactor the wrapping exception used in `chunked_download_source` to
prevent the retry strategy from reattempting failed requests. The new
implementation preserves the original `exception_ptr`, making the root
cause clearer and easier to diagnose.

(cherry picked from commit 1d34657b14)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f9bc211966 s3_client: fix indentation
Reformat `client::make_request` to fix the indentation of `if` block

(cherry picked from commit 58a1cff3db)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
4aff338282 s3_client: add max for client level retries
To prevent client retrying indefinitely time skew and authentication errors add `max_attempts` to the `client::make_request`

(cherry picked from commit 43acc0d9b9)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
8b7dce8334 s3_client: remove s3_retry_strategy
It never worked as intended, so the credentials handling is moving to the same place where we handle time skew, since we have to reauthenticate the request

(cherry picked from commit 116823a6bc)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
2afd323838 s3_client: support high-level request retries
Add an option to retry S3 requests at the highest level, including
reinitializing headers and reauthenticating. This addresses cases
where retrying the same request fails, such as when the S3 server
rejects a timestamp older than 15 minutes.

(cherry picked from commit 185d5cd0c6)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
f2f415a742 s3_client: just reformat make_request
Just reformat previously changed methods to improve readability

(cherry picked from commit db1ca8d011)
2025-10-21 12:26:50 +00:00
Ernest Zaslavsky
5c2d8bd273 s3_client: unify make_request implementation
Refactor `make_request` to use a single core implementation that
handles authentication and issues the HTTP request. All overloads now
delegate to this unified method.

(cherry picked from commit 55fb2223b6)
2025-10-21 12:26:49 +00:00
Piotr Dulikowski
5ae41b59f3 tablet_allocator: allow merges in base tables if rf-rack-valid=true
Tablet merge of base tables is only safe if there is at most one replica
in each rack. For more details on why it is the case please see
scylladb/scylladb#17265. If the rf-rack-valid-keyspaces is turned on,
this condition is satisfied, so allow it in that case.

Fixes: scylladb/scylladb#26273
(cherry picked from commit 189ad96728)
2025-10-21 04:02:56 +00:00
Lakshmi Narayanan Sreethar
45b9675d28 compaction: fix use after free when strategy is altered during compaction
The `compaction_strategy_state` class holds strategy specific state via
a `std::variant` containing different state types. When a compaction
strategy performs compaction, it retrieves a reference to its state from
the `compaction_strategy_state` object. If the table's compaction
strategy is ALTERed while a compaction is in progress, the
`compaction_strategy_state` object gets replaced, destroying the old
state. This leaves the ongoing compaction holding a dangling reference,
resulting in a use after free.

Fix this by using `seastar::shared_ptr` for the state variant
alternatives(`leveled_compaction_strategy_state_ptr` and
`time_window_compaction_strategy_state_ptr`). The compaction strategies
now hold a copy of the shared_ptr, ensuring the state remains valid for
the duration of the compaction even if the strategy is altered.

The `compaction_strategy_state` itself is still passed by reference and
only the variant alternatives use shared_ptrs. This allows ongoing
compactions to retain ownership of the state independently of the
wrapper's lifetime.

Fixes #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 18c071c94b)
2025-10-21 00:59:33 +00:00
Lakshmi Narayanan Sreethar
f1e1c7db4c compaction/twcs: pass compaction_strategy_state to internal methods
During TWCS compaction, multiple methods independently fetch the
compaction_strategy_state using get_state(). This can lead to
inconsistencies if the compaction strategy is ALTERed while the
compaction is in progress.

This patch fixes a part of this issue by passing down the state to the
lower level methods as parameters instead of fetching it repeatedly.

Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 35159e5b02)
2025-10-21 00:59:33 +00:00
Lakshmi Narayanan Sreethar
5e1f32b3d4 compaction_manager: hold a copy to compaction strategy in maybe_wait_for_sstable_count_reduction
The method `maybe_wait_for_sstable_count_reduction()`, when retrieving
the list of sstables for a possible compaction, holds a reference to the
compaction strategy. If the strategy is updated during execution, it can
cause a use after free issue. To prevent this, hold a copy of the
compaction strategy so it isn’t yanked away during the method’s
execution.

Refs #26546
Refs #25913

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
(cherry picked from commit 1cd43bce0e)
2025-10-21 00:59:32 +00:00
Botond Dénes
99f2dd02bf Merge '[Backport 2025.4] raft topology: disable schema pulls in the Raft-based recovery procedure' from Scylladb[bot]
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

We fix this issue and add a regression test in this PR.

Fixes #26569

This is an important bug fix, so it should be backported to all branches
with the Raft-based recovery procedure (2025.2 and newer branches).

- (cherry picked from commit ec3a35303d)

- (cherry picked from commit da8748e2b1)

- (cherry picked from commit 71de01cd41)

Parent PR: #26572

Closes scylladb/scylladb#26599

* github.com:scylladb/scylladb:
  test: test_raft_recovery_entry_loss: fix the typo in the test case name
  test: verify that schema pulls are disabled in the Raft-based recovery procedure
  raft topology: disable schema pulls in the Raft-based recovery procedure
2025-10-20 10:39:52 +03:00
Botond Dénes
76a6a059c8 Merge '[Backport 2025.4] Fix vector store client flaky test' from Scylladb[bot]
This series of patches improves test vector_store_client_test stability. The primary issue with flaky connections was discovered while working on PR #26308.

Key Changes:
- Fixes premature connection closures in the mock server:
The mock HTTP server was not consuming request payloads, causing it to close connections immediately after a response. Subsequent tests attempting to reuse these closed connections would fail intermittently, leading to flakiness. The server has been updated to handle payloads correctly.

- Removes a retry workaround:
With the underlying connection issue resolved, the retry logic in the vector_store_client_test_ann_request test is no longer needed and has been removed.

- Mocks the DNS resolver in tests:
The vector_store_client_uri_update_to_invalid test has been corrected to mock DNS lookups, preventing it from making real network requests.

- Corrects request timeout handling:
A bug has been fixed where the request timeout was not being reset between consecutive requests.

- Unifies test timeouts:
Timeouts have been standardized across the test suite for consistency.

Fixes: #26468

It is recommended to backport this series to the 2025.4 branch. Since these changes only affect test code and do not alter any production logic, the backport is safe. Addressing this test flakiness will improve the stability of the CI pipeline and prevent it from blocking unrelated patches.

- (cherry picked from commit ac5e9c34b6)

- (cherry picked from commit 2eb752e582)

- (cherry picked from commit d99a4c3bad)

- (cherry picked from commit 0de1fb8706)

- (cherry picked from commit 62deea62a4)

Parent PR: #26374

Closes scylladb/scylladb#26551

* github.com:scylladb/scylladb:
  vector_search: Unify test timeouts
  vector_search: Fix missing timeout reset
  vector_search: Refactor ANN request test
  vector_search: Fix flaky connection in tests
  vector_search: Fix flaky test by mocking DNS queries
2025-10-20 10:35:45 +03:00
Michał Chojnowski
6ff4910d96 test/cluster/test_bti_index.py: avoid a race with CQL tracing
The test uses CQL tracing to check which files were read by a query.
This is flaky if the coordinator and the replica are different shards,
because the Python driver only waits for the coordinator, and not
for replicas, to finish writing their traces.
(So it might happen that the Python driver returns a result
with only coordinator events and no replica events).

Let's just dodge the issue by using --smp=1.

Fixes scylladb/scylladb#26432

Closes scylladb/scylladb#26434

(cherry picked from commit c35b82b860)

Closes scylladb/scylladb#26492
2025-10-20 10:32:58 +03:00
Botond Dénes
d213953d0a Merge '[Backport 2025.4] tools: fix documentation links after change to source-available' from Scylladb[bot]
Some tools commands have links to online documentation in their help output. These links were left behind in the source-available change, they still point to the old opensource docs. Furthermore, the links in the scylla-sstable help output always point to the latest stable release's documentation, instead of the appropriate one for the branch the tool was built from. Fix both of these.

Fixes: scylladb/scylladb#26320

Broken documentation link fix for the  tool help output, needs backport to all live source-available versions.

- (cherry picked from commit 5a69838d06)

- (cherry picked from commit 15a4a9936b)

- (cherry picked from commit fe73c90df9)

Parent PR: #26322

Closes scylladb/scylladb#26390

* github.com:scylladb/scylladb:
  tools/scylla-sstable: fix doc links
  release: adjust doc_link() for the post source-available world
  tools/scylla-nodetool: remove trailing " from doc urls
2025-10-20 10:31:05 +03:00
Michał Jadwiszczak
f5e76d0fcb test/cluster/test_view_building_coordinator: skip reproducer instead of xfail
The reproducer for issue scylladb/scylladb#26244 takes some time
and since the test is failing, there is no point in wasting resources on
it.
We can change the xfail mark to skip.

Refs scylladb/scylladb#26244

Closes scylladb/scylladb#26350

(cherry picked from commit d92628e3bd)

Closes scylladb/scylladb#26365
2025-10-20 10:30:34 +03:00
Aleksandra Martyniuk
2819b8b755 test: wait for cql in test_two_tablets_concurrent_repair_and_migration_repair_writer_level
In test_two_tablets_concurrent_repair_and_migration_repair_writer_level
safe_rolling_restart returns ready cql. However, get_all_tablet_replicas
uses the cql reference from manager that isn't ready. Wait for cql.

Fixes: #26328

Closes scylladb/scylladb#26349

(cherry picked from commit 0e73ce202e)

Closes scylladb/scylladb#26362
2025-10-20 10:29:56 +03:00
Avi Kivity
245d27347b dht, sstables: replace vector with chunked_vector when computing sstable shards
sstable::compute_shards_for_this_sstable() has a temporary of type
std::vector<dht::token_range> (aka dht::partition_range_vector), which
allocates a contiguous 300k when loading an sstable from disk. This
causes large allocation warnings (it doesn't really stress the allocator
since this typically happens during startup, but best to clear the warning
anyway).

Fix this by changing the container to by chunked_vector. It is passed
to dht::ring_position_range_vector_sharder, but since we're the only
user, we can change that class to accept the new type.

Fixes #24198.

Closes scylladb/scylladb#26353

(cherry picked from commit 7230a04799)

Closes scylladb/scylladb#26360
2025-10-20 10:28:59 +03:00
Patryk Jędrzejczak
323a7b8c55 test: test_raft_recovery_entry_loss: fix the typo in the test case name
(cherry picked from commit 71de01cd41)
2025-10-17 10:27:33 +00:00
Patryk Jędrzejczak
cd0bb11eef test: verify that schema pulls are disabled in the Raft-based recovery procedure
We do this at the end of `test_raft_recovery_entry_loss`. It's not worth
to add a separate regression test, as tests of the recovery procedure
are complicated and have a long running time. Also, we choose
`test_raft_recovery_entry_loss` out of all tests of the recovery
procedure because it does some schema changes.

(cherry picked from commit da8748e2b1)
2025-10-17 10:27:32 +00:00
Patryk Jędrzejczak
95d4206585 raft topology: disable schema pulls in the Raft-based recovery procedure
Schema pulls should always be disabled when group 0 is used. However,
`migration_manager::disable_schema_pulls()` is never called during
a restart with `recovery_leader` set in the Raft-based recovery
procedure, which causes schema pulls to be re-enabled on all live nodes
(excluding the nodes replacing the dead nodes). Moreover, schema pulls
remain enabled on each node until the node is restarted, which could
be a very long time.

The old gossip-based recovery procedure doesn't have this problem
because we disable schema pulls after completing the upgrade-to-group0
procedure, which is a part of the old recovery procedure.

Fixes #26569

(cherry picked from commit ec3a35303d)
2025-10-17 10:27:32 +00:00
Botond Dénes
2bc0c9c45b Update seastar submodule
* seastar 37983cd0...60e4b3b9 (1):
  > http: add "Connection: close" header to final server response.

Refs #26298
2025-10-17 10:22:45 +03:00
Artsiom Mishuta
de5a13db28 test.py: reintroducing sudo in resource_gather.py
conditionally reintroducing sudo for resource gathering
when running under docker

related: https://github.com/scylladb/scylladb/pull/26294#issuecomment-3346968097

fixes: https://github.com/scylladb/scylladb/issues/26312

Closes scylladb/scylladb#26401

(cherry picked from commit 99455833bd)

Closes scylladb/scylladb#26473
2025-10-17 09:27:13 +03:00
Pavel Emelyanov
dc3c6c3090 Update seastar submodule (iotune fixes for i7i/i4i)
* seastar c8a3515f9...37983cd04 (2):
  > iotune: fix very long warm up duration on systems with high cpu count
  > Merge '[Backport 2025.4] iotune: Add warmup period to measurements' from Robert Bindar
    iotune: Ignore measurements during warmup period
    iotune: Fix warmup calculation bug and botched rebase

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Fixes #26530

Closes scylladb/scylladb#26583
2025-10-16 20:30:37 +03:00
Jenkins Promoter
83babc20e3 Update ScyllaDB version to: 2025.4.0-rc2 2025-10-15 15:43:09 +03:00
Ernest Zaslavsky
04b9e98ef8 s3_client: track memory starvation in background filling fiber
Introduce a counter metric to monitor instances where the background
filling fiber is blocked due to insufficient memory in the S3 client.

Closes scylladb/scylladb#26466

(cherry picked from commit 413739824f)

Closes scylladb/scylladb#26555
2025-10-15 12:03:09 +02:00
Michał Chojnowski
de8c2a8196 test/boost/sstable_compressor_factory_test: fix thread-unsafe usage of Boost.Test
It turns out that Boost assertions are thread-unsafe,
(and can't be used from multiple threads concurrently).
This causes the test to fail with cryptic log corruptions sometimes.
Fix that by switching to thread-safe checks.

Fixes scylladb/scylladb#24982

Closes scylladb/scylladb#26472

(cherry picked from commit 7c6e84e2ec)

Closes scylladb/scylladb#26554
2025-10-15 12:08:54 +03:00
Jenkins Promoter
dd2e8a2105 Update pgo profiles - aarch64 2025-10-15 05:03:22 +03:00
Jenkins Promoter
90fd618967 Update pgo profiles - x86_64 2025-10-15 04:31:33 +03:00
Karol Nowacki
da8bd30a5b vector_search: Unify test timeouts
The test previously used separate timeouts for requests (5s) and the
overall test case (10s).

This change unifies both timeouts to 10 seconds.

(cherry picked from commit 62deea62a4)
2025-10-14 22:49:42 +00:00
Karol Nowacki
4e9a42f343 vector_search: Fix missing timeout reset
The `vector_store_client_test` could be flaky because the request timeout
was not consistently reset in all code paths. This could lead to a
timeout from a previous operation firing prematurely and failing the
test.

The fix ensures `abort_source_timeout` is reset before each request.
The implementation is also simplified by changing
`abort_source_timeout::reset` that combines the reset and arm
operations into a same invocation.

(cherry picked from commit 0de1fb8706)
2025-10-14 22:49:42 +00:00
Karol Nowacki
6db7481c7a vector_search: Refactor ANN request test
Refactor the `vector_store_client_test_ann_request` test to use the
`vs_mock_server` class, unifying the structure of the test cases.

This change also removes retry logic that waited for the server to be ready.
This is no longer necessary because the handler now exists for all index names
and consumes the entire request payload, preventing connection closures.

Previously, the server did not handle requests for unconfigured
indexes, which caused the connection to close. This could lead to a
race condition where the client would attempt to reuse a closed
connection.

(cherry picked from commit d99a4c3bad)
2025-10-14 22:49:42 +00:00
Karol Nowacki
62a5d4f932 vector_search: Fix flaky connection in tests
The vector store mock server was not reading the ANN request body,
which could cause it to prematurely close the connection.

This could lead to a race condition where the client attempts to reuse a
closed connection from its pool, resulting in a flaky test.

The fix is to always read the request body in the mock server.

(cherry picked from commit 2eb752e582)
2025-10-14 22:49:42 +00:00
Karol Nowacki
f5319b06ae vector_search: Fix flaky test by mocking DNS queries
The `vector_store_client_uri_update_to_invalid` test was flaky because
it performed real DNS lookups, making it dependent on the network
environment.

This commit replaces the live DNS queries with a mock to make the test
hermetic and prevent intermittent failures.

`vector_search_metrics_test` test did not call configure{vs},
as a consequence the test did real DNS queries, which made the test
flaky.

The refreshes counter increment has been moved before the call to the resolver.
In tests, the resolver is mocked leading to lack of increments in production code.
Without this change, there is no way to test DNS counter increments.

The change also simplifies the test making it more readable.

(cherry picked from commit ac5e9c34b6)
2025-10-14 22:49:42 +00:00
Piotr Wieczorek
c191c31682 alternator: Correct RCU undercount in BatchGetItem
The `describe_multi_item` function treated the last reference-captured
argument as the number of used RCU half units. The caller
`batch_get_item`, however, expected this parameter to hold an item size.
This RCU value was then passed to
`rcu_consumed_capacity_counter::get_half_units`, treating the
already-calculated RCU integer as if it were a size in bytes.

This caused a second conversion that undercounted the true RCU. During
conversion, the number of bytes is divided by `RCU_BLOCK_SIZE_LENGTH`
(=4KB), so the double conversion divided the number of bytes by 16 MB.

The fix removes the second conversion in `describe_multi_item` and
changes the API of `describe_multi_item`.

Fixes: https://github.com/scylladb/scylladb/pull/25847

Closes scylladb/scylladb#25842

(cherry picked from commit a55c5e9ec7)

Closes scylladb/scylladb#26539
2025-10-14 11:53:09 +03:00
Dawid Mędrek
a4fd7019e3 replica/database: Fix description of validate_tablet_views_indexes
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.

Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.

Fixes scylladb/scylladb#26420

Closes scylladb/scylladb#26421

(cherry picked from commit a9577e4d52)

Closes scylladb/scylladb#26476
2025-10-14 11:52:34 +03:00
Pavel Emelyanov
e18072d4b8 Merge '[Backport 2025.4] service/qos: set long timeout for auth queries on SL cache update' from Scylladb[bot]
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes https://github.com/scylladb/scylladb/issues/25290

backport possible to improve stability

- (cherry picked from commit a1161c156f)

- (cherry picked from commit 3c3dd4cf9d)

- (cherry picked from commit ad1a5b7e42)

Parent PR: #26180

Closes scylladb/scylladb#26479

* github.com:scylladb/scylladb:
  service/qos: set long timeout for auth queries on SL cache update
  auth: add query_state parameter to query functions
  auth: refactor query_all_directly_granted
2025-10-13 15:26:21 +03:00
Robert Bindar
7353aa5aa5 Make scylla_io_setup detect request size for best write IOPS
We noticed during work on scylladb/seastar#2802 that on i7i family
(later proved that it's valid for i4i family as well),
the disks are reporting the physical sector sizes incorrectly
as 512bytes, whilst we proved we can render much better write IOPS with
4096bytes.

This is not the case on AWS i3en family where the reported 512bytes
physical sector size is also the size we can achieve the best write IOPS.

This patch works around this issue by changing `scylla_io_setup` to parse
the instance type out of `/sys/devices/virtual/dmi/id/product_name`
and run iotune with the correct request size based on the instance type.

Signed-off-by: Robert Bindar <robert.bindar@scylladb.com>

Closes scylladb/scylladb#25315

(cherry picked from commit 2c74a6981b)

Closes scylladb/scylladb#26474
2025-10-13 15:25:16 +03:00
Michał Chojnowski
ec0b31b193 docs: fix a parameter name in API calls in sstable-dictionary-compression.rst
The correct argument name is `cf`, not `table`.

Fixes scylladb/scylladb#25275

Closes scylladb/scylladb#26447

(cherry picked from commit 87e3027c81)

Closes scylladb/scylladb#26495
2025-10-12 21:10:25 +03:00
Patryk Jędrzejczak
b5c3e2465f test: test_raft_no_quorum: test_can_restart: deflake the read barrier call
Expecting the group 0 read barrier to succeed with a timeout of 1s, just
after restarting 3 out of 5 voters, turned out to be flaky. In some
unlikely scenarios, such as multiple vote splits, the Raft leader
election could finish after the read barrier times out.

To deflake the test, we increase the timeout of Raft operations back to
300s for read barriers we expect to succeed.

Fixes #26457

Closes scylladb/scylladb#26489

(cherry picked from commit 5f68b9dc6b)

Closes scylladb/scylladb#26522
2025-10-12 21:02:02 +03:00
Asias He
3cae4a21ab repair: Rename incremental mode name
Using the name regular as the incremental mode could be confusing, since
regular might be interpreted as the non-incremental repair. It is better
to use incremental directly.

Before:

- regular (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

After:

- incremental (standard incremental repair)
- full (full incremental repair)
- disabled (incremental repair disabled)

Fixes #26503

Closes scylladb/scylladb#26504

(cherry picked from commit 13dd88b010)

Closes scylladb/scylladb#26521
2025-10-12 21:01:05 +03:00
Ernest Zaslavsky
5c6335e029 s3_client: fix when condition to prevent infinite locking
Refine condition variable predicate in filling fiber to avoid
indefinite waiting when `close` is invoked.

Closes scylladb/scylladb#26449

(cherry picked from commit c2bab430d7)

Closes scylladb/scylladb#26497
2025-10-12 16:19:48 +03:00
Avi Kivity
de4975d181 dist: scylla_raid_setup: don't override XFS block size on modern kernels
In 6977064693 ("dist: scylla_raid_setup:
reduce xfs block size to 1k"), we reduced the XFS block size to 1k when
possible. This is because commitlog wants to write the smallest amount
of padding it can, and older Linux could only write a multiple of the
block size. Modern Linux [1] can O_DIRECT overwrite a range smaller than
a filesystem block.

However, this doesn't play well with some SSDs that have 512 byte
logical sector size and 4096 byte physical sector size - it causes them
to issue read-modify-writes.

To improve the situation, if we detect that the kernel is recent enough,
format the filesystem with its default block size, which should be optimal.

Note that commitlog will still issue sub-4k writes, which can translate
to RMW. There, we believe that the amplification is reduced since
sequential sub-physical-sector writes can be merged, and that the overhead
from commitlog space amplification is worse than the RMW overhead.

Tested on AWS i4i.large. fsqual report:

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   4096
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0.0003 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.7961 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 4096, iodepth 3): 0.8006 (BAD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 3): 0.0001 (GOOD)
context switch per write io (size-unchanging, append, blocksize 4096, iodepth 7): 0 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.125 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

The sub-block overwrite cases are GOOD.

In comparison, the fsqual report for 1k (similar):

```
memory DMA alignment:    512
disk DMA alignment:      512
filesystem block size:   1024
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0.0005 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.7948 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0015 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0022 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.4999 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 1): 0 (GOOD)
context switch per write io (size-changing, append, blocksize 1024, iodepth 3): 0.798 (BAD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 3): 0.0012 (GOOD)
context switch per write io (size-unchanging, append, blocksize 1024, iodepth 7): 0.0019 (GOOD)
context switch per write io (size-unchanging, append, blocksize 512, iodepth 1): 0.5 (BAD)
context switch per write io (size-unchanging, overwrite, blocksize 512, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 1): 0 (GOOD)
context switch per write io (size-unchanging, overwrite, blocksize 512, O_DSYNC, iodepth 3): 0 (GOOD)
context switch per read io (size-changing, append, blocksize 512, iodepth 30): 0 (GOOD)
```

Fixes #25441.

[1] ed1128c2d0

Closes scylladb/scylladb#25445

(cherry picked from commit 5d1846d783)

Closes scylladb/scylladb#26471
2025-10-12 16:17:16 +03:00
Piotr Dulikowski
1f73e18eaf Merge '[Backport 2025.4] db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Scylladb[bot]
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.

After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.

In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.

After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.

High-level implementation strategy:

1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
   and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
   and add a new one.
5. Update the user documentation.

Fixes scylladb/scylladb#23030

Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.

- (cherry picked from commit a1254fb6f3)

- (cherry picked from commit d6fcd18540)

- (cherry picked from commit 994f09530f)

- (cherry picked from commit 6322b5996d)

- (cherry picked from commit 71606ffdda)

- (cherry picked from commit 00222070cd)

- (cherry picked from commit 288be6c82d)

- (cherry picked from commit b409e85c20)

Parent PR: #25802

Closes scylladb/scylladb#26416

* github.com:scylladb/scylladb:
  view: Stop requiring experimental feature
  db/view: Verify valid configuration for tablet-based views
  db/view: Require rf_rack_valid_keyspaces when creating view
  test/cluster/random_failures: Skip creating secondary indexes
  test/cluster/mv: Mark test_mv_rf_change as skipped
  test/cluster: Adjust MV tests to RF-rack-validity
  test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
  db/view: Name requirement for views with tablets
2025-10-12 08:20:20 +02:00
Michał Jadwiszczak
931f9ca3db db/view/view_building_worker: update state again if some batch was finished during the update
There was a race between loop in `view_building_worker::run_view_building_state_observer()`
and a moment when a batch was finishing its work (`.finally()` callback
in `view_building_worker::batch::start()`).

State observer waits on `_vb_state_machine.event` CV and when it's
awoken, it takes group0 read apply mutex and updates its state. While
updating the state, the observer looks at `batch::state` field and
reacts to it accordingly.
On the other hand, when a batch finishes its work, it sets `state` field
to `batch_state::finished` and does a broadcast on
`_vb_state_machine.event` CV.
So if the batch will execute the callback in `.finally()` while the
observer is updating its state, the observer may miss the event on the
CV and it will never notice that the batch was finished.

This patch fixes this by adding a `some_batch_finished` flag. Even if
the worker won't see an event on the CV, it will notice that the flag
was set and it will do next iteration.

Fixes scylladb/scylladb#26204

Closes scylladb/scylladb#26289

(cherry picked from commit 8d0d53016c)

Closes scylladb/scylladb#26500
2025-10-10 09:53:22 +02:00
Piotr Dulikowski
3775e8e49a Merge '[Backport 2025.4] db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground' from Scylladb[bot]
db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like https://github.com/scylladb/scylladb/issues/26403).

Fixes https://github.com/scylladb/scylladb/issues/26417

The patch should be backported to 2025.4

- (cherry picked from commit 575dce765e)

- (cherry picked from commit 84e4e34d81)

Parent PR: #26446

Closes scylladb/scylladb#26501

* github.com:scylladb/scylladb:
  db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
  db/view/view_building_worker: futurize and rename `start_background_fibers()`
2025-10-10 09:52:45 +02:00
Michał Jadwiszczak
f4d9513e0f db/view/view_building_worker: move discover_existing_staging_sstables() to the foreground
This patch moves `discover_existing_staging_sstables()` to be executed
from main level, instead of running it on the background fiber.

This method need to be run only once during the startup to collect
existing staging sstables, so there is no need to do it in the
background. This change will increase debugability of any further issues
related to it (like scylladb/scylladb#26403).

Fixes scylladb/scylladb#26417

(cherry picked from commit 84e4e34d81)
2025-10-09 22:39:33 +00:00
Michał Jadwiszczak
5eeb1e3e76 db/view/view_building_worker: futurize and rename start_background_fibers()
Next commit will move `discover_existing_staging_sstables()`
to the foreground, so to prepare for this we need to futurize
`start_background_fibers()` method and change its name to better reflect
its purpose.

(cherry picked from commit 575dce765e)
2025-10-09 22:39:32 +00:00
Patryk Jędrzejczak
989aa0b237 raft topology: make the voter handler consider only group 0 members
In the Raft-based recovery procedure, we create a new group 0 and add
live nodes to it one by one. This means that for some time there are
nodes which belong to the topology, but not to the new group 0. The
voter handler running on the recovery leader incorrectly considers these
nodes while choosing voters.

The consequences:
- misleading logs, for example, "making servers {<ID of a non-member>}
  voters", where the non-member won't become a voter anyway,
- increased chance of majority loss during the recovery procedure, for
  example, all 3 nodes that first joined the new group 0 are in the same
  dc and rack, but only one of them becomes a voter because the voter
  handler tries to make non-members in other dcs/racks voters.

Fixes #26321

Closes scylladb/scylladb#26327

(cherry picked from commit 67d48a459f)

Closes scylladb/scylladb#26428
2025-10-09 18:17:49 +02:00
Michael Litvak
eba0a2cf72 service/qos: set long timeout for auth queries on SL cache update
pass an appropriate query state for auth queries called from service
level cache reload. we use the function qos_query_state to select a
query_state based on caller context - for internal queries, we set a
very long timeout.

the service level cache reload is called from group0 reload. we want it
to have a long timeout instead of the default 5 seconds for auth
queries, because we don't have strict latency requirement on the one
hand, and on the other hand a timeout exception is undesired in the
group0 reload logic and can break group0 on the node.

Fixes scylladb/scylladb#25290

(cherry picked from commit ad1a5b7e42)
2025-10-09 12:48:45 +00:00
Michael Litvak
3a9eb9b65f auth: add query_state parameter to query functions
add a query_state parameter to several auth functions that execute
internal queries. currently the queries use the
internal_distributed_query_state() query state, and we maintain this as
default, but we want also to be able to pass a query state from the
caller.

in particular, the auth queries currently use a timeout of 5 seconds,
and we will want to set a different timeout when executed in some
different context.

(cherry picked from commit 3c3dd4cf9d)
2025-10-09 12:48:45 +00:00
Michael Litvak
f75541b7b3 auth: refactor query_all_directly_granted
rewrite query_all_directly_granted to use execute_internal instead of
query_internal in a style that is more consistent with the rest of the
module.

This will also be useful for a later change because execute_internal
accepts an additional parameter of query_state.

(cherry picked from commit a1161c156f)
2025-10-09 12:48:45 +00:00
Michał Chojnowski
879db5855d utils/config_file: fix a missing allowed_values propagation in one of named_value constructors
In one of the constructors of `named_value`, the `allowed_values`
argument isn't used.

(This means that if some config entry uses this constructor,
the values aren't validated on the config layer,
and might give some lower layer a bad surprise).

Fix that.

Fixes scyllladb/scylladb#26371

Closes scylladb/scylladb#26196

(cherry picked from commit 3b338e36c2)

Closes scylladb/scylladb#26425
2025-10-09 13:19:41 +03:00
Michał Chojnowski
22d3ee5670 sstables/trie: actually apply BYPASS CACHE to index reads
BYPASS CACHE is implemented for `bti_index_reader` by
giving it its own private `cached_file` wrappers over
Partitions.db and Rows.db, instead of passing it
the shared `cached_file` owned by the sstable.

But due to an oversight, the private `cached_file`s aren't
constructed on top of the raw Partitions.db and Rows.db
files, but on top of `cached_file_impl` wrappers around
those files. Which means that BYPASS CACHE doesn't
actually do its job.

Tests based on `scylla_index_page_cache_*` metrics
and on CQL tracing still see the reads from the private
files as "cache misses", but those misses are served
from the shared cached files anyway, so the tests don't see
the problem. In this commit we extend `test_bti_index.py`
with a check that looks at reactor's `io_queue` metrics
instead, and catches the problem.

Fixes scylladb/scylladb#26372

Closes scylladb/scylladb#26373

(cherry picked from commit dbddba0794)

Closes scylladb/scylladb#26424
2025-10-09 13:17:29 +03:00
Dawid Mędrek
2bdf792f8e view: Stop requiring experimental feature
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.

We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.

Fixes scylladb/scylladb#23030

(cherry picked from commit b409e85c20)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
2e2d1f17bb db/view: Verify valid configuration for tablet-based views
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:

* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.

Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.

We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.

(cherry picked from commit 288be6c82d)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
e9aba62cc5 db/view: Require rf_rack_valid_keyspaces when creating view
We extend the requirements for being able to create materialized views
and secondary indexes in tablet-based keyspaces. It's now necessary to
enable the configuration option `rf_rack_valid_keyspaces`. This is
a stepping stone towards bringing materialized views and secondary
indexes with tablets out of the experimental phase.

We add a validation test to verify the changes.

Refs scylladb/scylladb#23030

(cherry picked from commit 00222070cd)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
a7d0cf6dd0 test/cluster/random_failures: Skip creating secondary indexes
Materialized views are going to require the configuration option
`rf_rack_valid_keyspaces` when being created in tablet-based keyspaces.
Since random-failure tests still haven't been adjusted to work with it,
and because it's not trivial, we skip the cases when we end up creating
or dropping an index.

(cherry picked from commit 71606ffdda)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
6e94c075e3 test/cluster/mv: Mark test_mv_rf_change as skipped
The test will not work with `rf_rack_valid_keyspaces`. Since the option
is going to become a requirement for using views with tablets, the test
will need to be rewritten to take that into consideration. Since that
adjustment doesn't seem trivial, we mark the test as skipped for the
time being.

(cherry picked from commit 6322b5996d)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
f90ca413a0 test/cluster: Adjust MV tests to RF-rack-validity
Some of the new tests covering materialized views explicitly disabled
the configuration option `rf_rack_valid_keyspaces`. It's going to become
a new requirement for views with tablets, so we adjust those tests and
enable the option. There is one exception, the test:

`cluster/mv/test_mv_topology_change.py::test_mv_rf_change`

We handle it separately in the following commit.

(cherry picked from commit 994f09530f)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
5e0f5f4b44 test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
The test cases in the file aren't run via an existing interface like
`do_with_cql_env`, but they rely on a more direct approach -- calling
one of the schema loader tools. Because of that, they manage the
`db::config` object on their own and don't enable the configuration
option `rf_rack_valid_keyspaces`.

That hasn't been a problem so far since the test doesn't attempt to
create RF-rack-invalid keyspaces anyway. However, in an upcoming commit,
we're going to further restrict views with tablets and require that the
option is enabled.

To prepare for that, we enable the option in all test cases. It's only
necessary in a small subset of them, but it won't hurt the enforce it
everywhere, so let's do that.

Refs scylladb/scylladb#23958

(cherry picked from commit d6fcd18540)
2025-10-06 13:19:54 +00:00
Dawid Mędrek
5d32fef3ae db/view: Name requirement for views with tablets
We add a named requirement, a function, for materialized views with tablets.
It decides whether we can create views and secondary indexes in a given
keyspace. It's a stepping stone towards modifying the requirements for it.

This way, we keep the code in one place, so it's not possible to forget
to modify it somewhere. It also makes it more organized and concise.

(cherry picked from commit a1254fb6f3)
2025-10-06 13:19:53 +00:00
Botond Dénes
1b5c46a796 Merge '[Backport 2025.4] test: dtest: test_limits.py: migrate from dtest' from Dario Mirovic
Backport motivation:
scylla-dtest PR [limits_test.py: remove tests already ported to scylladb repo](https://github.com/scylladb/scylla-dtest/pull/6232) that removes migrated tests got merged before branch-2025.4 separation
scylladb PR [test: dtest: test_limits.py: migrate from dtest](https://github.com/scylladb/scylladb/pull/26077) got merged after branch-2025.4 separation
This caused the tests to be fully removed from branch-2025.4. This backport PR makes sure the tests are present in scylladb branch-2025.4.

This PR migrates limits tests from dtest to this repository.

One reason is that there is an ongoing effort to migrate tests from dtest to here.

Debug logs are enabled on `test_max_cells` for `lsa-timing` logger, to have more information about memory reclaim operation times and memory chunk sizes. This will allow analysis of their value distributions, which can be helpful with debugging if the issue reoccurs.

Also, scylladb keeps sql files with metrics which, with some modifications, can be used to track metrics over time for some tests. This would show if there are pauses and spikes or the test performance is more or less consistent over time.

scylla-dtest PR that removes migrated tests:
[limits_test.py: remove tests already ported to scylladb repo #6232](https://github.com/scylladb/scylla-dtest/pull/6232)

Fixes #25097

- (cherry picked from commit 82e9623911)
- (cherry picked from commit 70128fd5c7)
- (cherry picked from commit 554fd5e801)
- (cherry picked from commit b3347bcf84)

Parent PR: #26077

Closes scylladb/scylladb#26359

* github.com:scylladb/scylladb:
  test: dtest: limits_test.py: test_max_cells log level
  test: dtest: limits_test.py: make the tests work
  test: dtest: test_limits.py: remove test that are not being migrated
  test: dtest: copy unmodified limits_test.py
2025-10-06 15:46:49 +03:00
Michał Chojnowski
29b5319bc6 sstables/sstable_directory: don't forget to delete other components when deleting TemporaryHashes.db
TemporaryHashes.db is a temporary sstable component used during ms
sstable writes. It's different from other sstable components in that
it's not included in the TOC. Because of this, it has a special case in
the logic that deletes unfinished sstables on boot.
(After Scylla dies in the middle of a sstable write).

But there's a bug in that special case,
which causes Scylla to forget to delete other components from the same unfinished sstable.

The code intends only to delete the TemporaryHashes.db file from the
`_state->generations_found` multimap, but it accidentally also deletes
the file's sibling components from the multimap. Fix that.

Fixes scylladb/scylladb#26393

(cherry picked from commit 6efb807c1a)
2025-10-06 10:22:50 +00:00
Michał Chojnowski
a2e620712c test/boost/database_test: fix two no-op distributed loader tests
There are two tests which effectively check nothing.

They intend to check that distributed loader removes "leftover" sstable
files. So they create some incomplete sstables, run the test env
on the directory, and the files disappeared.
But the test env completely clears the test directory before
the distributed loader looks at the files, so the tests succeed trivially.

Fix that by adding a config knob to the test env which instructs it
not to clear the directory before the test.

(cherry picked from commit 16cb223d7f)
2025-10-06 10:22:49 +00:00
Jenkins Promoter
f2c5874fa9 Update ScyllaDB version to: 2025.4.0-rc1 2025-10-03 21:26:12 +03:00
Botond Dénes
4049dae0b2 tools/scylla-sstable: fix doc links
The doc links in scylla-sstable help output are static, so they always
point to the documentation of the latest stable release, not to the
documentation of the release the tool binary is from. On top of that,
the links point to old open-source documentation, which is now EOL.
Fix both problems: point link at the new source-available documentation
pages and make them version aware.

(cherry picked from commit fe73c90df9)
2025-10-03 14:29:19 +00:00
Botond Dénes
8b83294c0f release: adjust doc_link() for the post source-available world
There is no more separate enterprise product and the doc urls are
slightly different.

(cherry picked from commit 15a4a9936b)
2025-10-03 14:29:19 +00:00
Botond Dénes
5930726b38 tools/scylla-nodetool: remove trailing " from doc urls
They are accidental leftover from a previous way of storing command
descriptions.

(cherry picked from commit 5a69838d06)
2025-10-03 14:29:19 +00:00
Dario Mirovic
664cdd3d99 test: dtest: limits_test.py: test_max_cells log level
Set `lsa-timing` logger log level to `debug`. This will help with
the analysis of the whole spectrum of memory reclaim operation
times and memory sizes.

Refs #25097

(cherry picked from commit b3347bcf84)
2025-10-01 22:40:34 +02:00
Dario Mirovic
4ea6c51fb1 test: dtest: limits_test.py: make the tests work
Remove unused imports and markers.
Remove Apache license header.

Enable the test in suite.yaml for `dev` and `debug` modes.

Refs #25097

(cherry picked from commit 554fd5e801)
2025-10-01 22:40:29 +02:00
Dario Mirovic
eb9babfd4a test: dtest: test_limits.py: remove test that are not being migrated
Refs #25097

(cherry picked from commit 70128fd5c7)
2025-10-01 22:40:24 +02:00
Dario Mirovic
558f460517 test: dtest: copy unmodified limits_test.py
Copy limits_test.py from scylla-dtest to test/cluster/dtest/limits_test.py.
Add license header.

Disable it for `debug`, `dev`, and `release` mode.

Refs #25097

(cherry picked from commit 82e9623911)
2025-10-01 22:40:16 +02:00
Jenkins Promoter
a9f4024c1b Update pgo profiles - aarch64 2025-10-01 04:42:23 +03:00
Jenkins Promoter
6969918d31 Update pgo profiles - x86_64 2025-10-01 04:20:49 +03:00
Luis Freitas
d69edfcd34 Update ScyllaDB version to: 2025.4.0-rc0 2025-09-30 18:51:59 +03:00
457 changed files with 16162 additions and 5165 deletions

View File

@@ -62,7 +62,7 @@ def create_pull_request(repo, new_branch_name, base_branch_name, pr, backport_pr
if is_draft:
labels_to_add.append("conflicts")
pr_comment = f"@{pr.user.login} - This PR was marked as draft because it has conflicts\n"
pr_comment += "Please resolve them and mark this PR as ready for review"
pr_comment += "Please resolve them and remove the 'conflicts' label. The PR will be made ready for review automatically."
backport_pr.create_issue_comment(pr_comment)
# Apply all labels at once if we have any
@@ -142,20 +142,31 @@ def backport(repo, pr, version, commits, backport_base_branch, is_collaborator):
def with_github_keyword_prefix(repo, pr):
pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
match = re.findall(pattern, pr.body, re.IGNORECASE)
if not match:
for commit in pr.get_commits():
match = re.findall(pattern, commit.commit.message, re.IGNORECASE)
if match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
break
if not match:
print(f'No valid close reference for {pr.number}')
return False
else:
# GitHub issue pattern: #123, scylladb/scylladb#123, or full GitHub URLs
github_pattern = rf"(?:fix(?:|es|ed))\s*:?\s*(?:(?:(?:{repo.full_name})?#)|https://github\.com/{repo.full_name}/issues/)(\d+)"
# JIRA issue pattern: PKG-92 or https://scylladb.atlassian.net/browse/PKG-92
jira_pattern = r"(?:fix(?:|es|ed))\s*:?\s*(?:(?:https://scylladb\.atlassian\.net/browse/)?([A-Z]+-\d+))"
# Check PR body for GitHub issues
github_match = re.findall(github_pattern, pr.body, re.IGNORECASE)
# Check PR body for JIRA issues
jira_match = re.findall(jira_pattern, pr.body, re.IGNORECASE)
match = github_match or jira_match
if match:
return True
for commit in pr.get_commits():
github_match = re.findall(github_pattern, commit.commit.message, re.IGNORECASE)
jira_match = re.findall(jira_pattern, commit.commit.message, re.IGNORECASE)
if github_match or jira_match:
print(f'{pr.number} has a valid close reference in commit message {commit.sha}')
return True
print(f'No valid close reference for {pr.number}')
return False
def main():
args = parse_args()

View File

@@ -18,7 +18,7 @@ jobs:
// Regular expression pattern to check for "Fixes" prefix
// Adjusted to dynamically insert the repository full name
const pattern = `Fixes:? (?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)`;
const pattern = `Fixes:? ((?:#|${repo.replace('/', '\\/')}#|https://github\\.com/${repo.replace('/', '\\/')}/issues/)(\\d+)|(?:https://scylladb\\.atlassian\\.net/browse/)?([A-Z]+-\\d+))`;
const regex = new RegExp(pattern);
if (!regex.test(body)) {

View File

@@ -0,0 +1,53 @@
name: Backport with Jira Integration
on:
push:
branches:
- master
- next-*.*
- branch-*.*
pull_request_target:
types: [labeled, closed]
branches:
- master
- next
- next-*.*
- branch-*.*
jobs:
backport-on-push:
if: github.event_name == 'push'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'push'
base_branch: ${{ github.ref }}
commits: ${{ github.event.before }}..${{ github.sha }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-on-label:
if: github.event_name == 'pull_request_target' && github.event.action == 'labeled'
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'labeled'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
head_commit: ${{ github.event.pull_request.base.sha }}
label_name: ${{ github.event.label.name }}
pr_state: ${{ github.event.pull_request.state }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}
backport-chain:
if: github.event_name == 'pull_request_target' && github.event.action == 'closed' && github.event.pull_request.merged == true
uses: scylladb/github-automation/.github/workflows/backport-with-jira.yaml@main
with:
event_type: 'chain'
base_branch: refs/heads/${{ github.event.pull_request.base.ref }}
pull_request_number: ${{ github.event.pull_request.number }}
pr_body: ${{ github.event.pull_request.body }}
secrets:
gh_token: ${{ secrets.AUTO_BACKPORT_TOKEN }}
jira_auth: ${{ secrets.USER_AND_KEY_FOR_JIRA_AUTOMATION }}

View File

@@ -3,19 +3,63 @@ name: Trigger Scylla CI Route
on:
issue_comment:
types: [created]
pull_request_target:
types:
- unlabeled
jobs:
trigger-jenkins:
if: github.event.comment.user.login != 'scylladbbot' && contains(github.event.comment.body, '@scylladbbot') && contains(github.event.comment.body, 'trigger-ci')
if: (github.event_name == 'issue_comment' && github.event.comment.user.login != 'scylladbbot') || github.event.label.name == 'conflicts'
runs-on: ubuntu-latest
steps:
- name: Verify Org Membership
id: verify_author
env:
EVENT_NAME: ${{ github.event_name }}
PR_AUTHOR: ${{ github.event.pull_request.user.login }}
PR_ASSOCIATION: ${{ github.event.pull_request.author_association }}
COMMENT_AUTHOR: ${{ github.event.comment.user.login }}
COMMENT_ASSOCIATION: ${{ github.event.comment.author_association }}
shell: bash
run: |
if [[ "$EVENT_NAME" == "pull_request_target" ]]; then
AUTHOR="$PR_AUTHOR"
ASSOCIATION="$PR_ASSOCIATION"
else
AUTHOR="$COMMENT_AUTHOR"
ASSOCIATION="$COMMENT_ASSOCIATION"
fi
ORG="scylladb"
if gh api "/orgs/${ORG}/members/${AUTHOR}" --silent 2>/dev/null; then
echo "member=true" >> $GITHUB_OUTPUT
else
echo "::warning::${AUTHOR} is not a member of ${ORG}; skipping CI trigger."
echo "member=false" >> $GITHUB_OUTPUT
fi
- name: Validate Comment Trigger
if: github.event_name == 'issue_comment'
id: verify_comment
env:
COMMENT_BODY: ${{ github.event.comment.body }}
shell: bash
run: |
CLEAN_BODY=$(echo "$COMMENT_BODY" | grep -v '^[[:space:]]*>')
if echo "$CLEAN_BODY" | grep -qi '@scylladbbot' && echo "$CLEAN_BODY" | grep -qi 'trigger-ci'; then
echo "trigger=true" >> $GITHUB_OUTPUT
else
echo "trigger=false" >> $GITHUB_OUTPUT
fi
- name: Trigger Scylla-CI-Route Jenkins Job
if: steps.verify_author.outputs.member == 'true' && (github.event_name == 'pull_request_target' || steps.verify_comment.outputs.trigger == 'true')
env:
JENKINS_USER: ${{ secrets.JENKINS_USERNAME }}
JENKINS_API_TOKEN: ${{ secrets.JENKINS_TOKEN }}
JENKINS_URL: "https://jenkins.scylladb.com"
PR_NUMBER: "${{ github.event.issue.number || github.event.pull_request.number }}"
PR_REPO_NAME: "${{ github.event.repository.full_name }}"
run: |
PR_NUMBER=${{ github.event.issue.number }}
PR_REPO_NAME=${{ github.event.repository.full_name }}
curl -X POST "$JENKINS_URL/job/releng/job/Scylla-CI-Route/buildWithParameters?PR_NUMBER=$PR_NUMBER&PR_REPO_NAME=$PR_REPO_NAME" \
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail -i -v
--user "$JENKINS_USER:$JENKINS_API_TOKEN" --fail

2
.gitmodules vendored
View File

@@ -1,6 +1,6 @@
[submodule "seastar"]
path = seastar
url = ../seastar
url = ../scylla-seastar
ignore = dirty
[submodule "swagger-ui"]
path = swagger-ui

View File

@@ -78,7 +78,7 @@ fi
# Default scylla product/version tags
PRODUCT=scylla
VERSION=2025.4.0-dev
VERSION=2025.4.6
if test -f version
then

View File

@@ -136,6 +136,7 @@ future<> controller::start_server() {
[this, addr, alternator_port, alternator_https_port, creds = std::move(creds)] (server& server) mutable {
return server.init(addr, alternator_port, alternator_https_port, creds,
_config.alternator_enforce_authorization,
_config.alternator_warn_authorization,
&_memory_limiter.local().get_semaphore(),
_config.max_concurrent_requests_per_shard);
}).handle_exception([this, addr, alternator_port, alternator_https_port] (std::exception_ptr ep) {

View File

@@ -16,6 +16,7 @@
#include "cdc/cdc_options.hh"
#include "auth/service.hh"
#include "db/config.hh"
#include "db/view/view_build_status.hh"
#include "mutation/tombstone.hh"
#include "utils/log.hh"
#include "schema/schema_builder.hh"
@@ -107,6 +108,20 @@ extern const sstring TTL_TAG_KEY("system:ttl_attribute");
// following ones are base table's keys added as needed or range key list will be empty.
static const sstring SPURIOUS_RANGE_KEY_ADDED_TO_GSI_AND_USER_DIDNT_SPECIFY_RANGE_KEY_TAG_KEY("system:spurious_range_key_added_to_gsi_and_user_didnt_specify_range_key");
// The following tags also have the "system:" prefix but are NOT used
// by Alternator to store table properties - only the user ever writes to
// them, as a way to configure the table. As such, these tags are writable
// (and readable) by the user, and not hidden by tag_key_is_internal().
// The reason why both hidden (internal) and user-configurable tags share the
// same "system:" prefix is historic.
// Setting the tag with a numeric value will enable a specific initial number
// of tablets (setting the value to 0 means enabling tablets with
// an automatic selection of the best number of tablets).
// Setting this tag to any non-numeric value (e.g., an empty string or the
// word "none") will ask to disable tablets.
static constexpr auto INITIAL_TABLETS_TAG_KEY = "system:initial_tablets";
enum class table_status {
active = 0,
@@ -129,7 +144,8 @@ static std::string_view table_status_to_sstring(table_status tbl_status) {
return "UNKNOWN";
}
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type, const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat);
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type,
const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat, const db::tablets_mode_t::mode tablets_mode);
static map_type attrs_type() {
static thread_local auto t = map_type_impl::get_instance(utf8_type, bytes_type, true);
@@ -243,7 +259,8 @@ executor::executor(gms::gossiper& gossiper,
_mm(mm),
_sdks(sdks),
_cdc_metadata(cdc_metadata),
_enforce_authorization(_proxy.data_dictionary().get_config().alternator_enforce_authorization()),
_enforce_authorization(_proxy.data_dictionary().get_config().alternator_enforce_authorization),
_warn_authorization(_proxy.data_dictionary().get_config().alternator_warn_authorization),
_ssg(ssg),
_parsed_expression_cache(std::make_unique<parsed::expression_cache>(
parsed::expression_cache::config{_proxy.data_dictionary().get_config().alternator_max_expression_cache_entries_per_shard},
@@ -879,15 +896,37 @@ future<executor::request_return_type> executor::describe_table(client_state& cli
co_return rjson::print(std::move(response));
}
// This function increments the authorization_failures counter, and may also
// log a warn-level message and/or throw an access_denied exception, depending
// on what enforce_authorization and warn_authorization are set to.
// Note that if enforce_authorization is false, this function will return
// without throwing. So a caller that doesn't want to continue after an
// authorization_error must explicitly return after calling this function.
static void authorization_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, std::string msg) {
stats.authorization_failures++;
if (enforce_authorization) {
if (warn_authorization) {
elogger.warn("alternator_warn_authorization=true: {}", msg);
}
throw api_error::access_denied(std::move(msg));
} else {
if (warn_authorization) {
elogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {}", msg);
}
}
}
// Check CQL's Role-Based Access Control (RBAC) permission_to_check (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(
bool enforce_authorization,
bool warn_authorization,
const service::client_state& client_state,
const schema_ptr& schema,
auth::permission permission_to_check) {
if (!enforce_authorization) {
auth::permission permission_to_check,
alternator::stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
// Unfortunately, the fix for issue #23218 did not modify the function
@@ -902,31 +941,33 @@ future<> verify_permission(
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
throw api_error::access_denied(fmt::format(
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"Write access denied on internal table {}.{} to role {} because it is not a superuser",
schema->ks_name(), schema->cf_name(), username));
co_return;
}
}
auto resource = auth::make_data_resource(schema->ks_name(), schema->cf_name());
if (!co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
if (!client_state.user() || !client_state.user()->name ||
!co_await client_state.check_has_permission(auth::command_desc(permission_to_check, resource))) {
sstring username = "<anonymous>";
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
// Using exceptions for errors makes this function faster in the
// success path (when the operation is allowed).
throw api_error::access_denied(format(
"{} access on table {}.{} is denied to role {}",
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"{} access on table {}.{} is denied to role {}, client address {}",
auth::permissions::to_string(permission_to_check),
schema->ks_name(), schema->cf_name(), username));
schema->ks_name(), schema->cf_name(), username, client_state.get_client_address()));
}
}
// Similar to verify_permission() above, but just for CREATE operations.
// Those do not operate on any specific table, so require permissions on
// ALL KEYSPACES instead of any specific table.
future<> verify_create_permission(bool enforce_authorization, const service::client_state& client_state) {
if (!enforce_authorization) {
static future<> verify_create_permission(bool enforce_authorization, bool warn_authorization, const service::client_state& client_state, alternator::stats& stats) {
if (!enforce_authorization && !warn_authorization) {
co_return;
}
auto resource = auth::resource(auth::resource_kind::data);
@@ -935,7 +976,7 @@ future<> verify_create_permission(bool enforce_authorization, const service::cli
if (client_state.user() && client_state.user()->name) {
username = client_state.user()->name.value();
}
throw api_error::access_denied(format(
authorization_error(stats, enforce_authorization, warn_authorization, fmt::format(
"CREATE access on ALL KEYSPACES is denied to role {}", username));
}
}
@@ -952,7 +993,7 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
schema_ptr schema = get_table(_proxy, request);
rjson::value table_description = co_await fill_table_description(schema, table_status::deleting, _proxy, client_state, trace_state, permit);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::DROP);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::DROP, _stats);
co_await _mm.container().invoke_on(0, [&, cs = client_state.move_to_other_shard()] (service::migration_manager& mm) -> future<> {
size_t retries = mm.get_concurrent_ddl_retries();
for (;;) {
@@ -966,8 +1007,8 @@ future<executor::request_return_type> executor::delete_table(client_state& clien
throw api_error::resource_not_found(fmt::format("Requested resource not found: Table: {} not found", table_name));
}
auto m = co_await service::prepare_column_family_drop_announcement(_proxy, keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
auto m2 = co_await service::prepare_keyspace_drop_announcement(_proxy, keyspace_name, group0_guard.write_timestamp());
auto m = co_await service::prepare_column_family_drop_announcement(p.local(), keyspace_name, table_name, group0_guard.write_timestamp(), service::drop_views::yes);
auto m2 = co_await service::prepare_keyspace_drop_announcement(p.local(), keyspace_name, group0_guard.write_timestamp());
std::move(m2.begin(), m2.end(), std::back_inserter(m));
@@ -1204,12 +1245,13 @@ void rmw_operation::set_default_write_isolation(std::string_view value) {
// Alternator uses tags whose keys start with the "system:" prefix for
// internal purposes. Those should not be readable by ListTagsOfResource,
// nor writable with TagResource or UntagResource (see #24098).
// Only a few specific system tags, currently only system:write_isolation,
// are deliberately intended to be set and read by the user, so are not
// considered "internal".
// Only a few specific system tags, currently only "system:write_isolation"
// and "system:initial_tablets", are deliberately intended to be set and read
// by the user, so are not considered "internal".
static bool tag_key_is_internal(std::string_view tag_key) {
return tag_key.starts_with("system:") &&
tag_key != rmw_operation::WRITE_ISOLATION_TAG_KEY;
return tag_key.starts_with("system:")
&& tag_key != rmw_operation::WRITE_ISOLATION_TAG_KEY
&& tag_key != INITIAL_TABLETS_TAG_KEY;
}
enum class update_tags_action { add_tags, delete_tags };
@@ -1290,7 +1332,7 @@ future<executor::request_return_type> executor::tag_resource(client_state& clien
if (tags->Size() < 1) {
co_return api_error::validation("The number of tags must be at least 1") ;
}
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::add_tags);
});
@@ -1311,7 +1353,7 @@ future<executor::request_return_type> executor::untag_resource(client_state& cli
schema_ptr schema = get_table_from_arn(_proxy, rjson::to_string_view(*arn));
get_stats_from_schema(_proxy, *schema)->api_operations.untag_resource++;
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [tags](std::map<sstring, sstring>& tags_map) {
update_tags_map(*tags, tags_map, update_tags_action::delete_tags);
});
@@ -1496,8 +1538,26 @@ bytes extract_from_attrs_column_computation::compute_value(const schema&, const
on_internal_error(elogger, "extract_from_attrs_column_computation::compute_value called without row");
}
// Because `CreateTable` request creates GSI/LSI together with the base table (so the base table is empty),
// we can skip view building process and immediately mark the view as built on all nodes.
//
// However, we can do this only for tablet-based views because `view_building_worker` will automatically propagate
// this information to `system.built_views` table (see `view_building_worker::update_built_views()`).
// For vnode-based views, `view_builder` will process the view and mark it as built.
static future<> mark_view_schemas_as_built(utils::chunked_vector<mutation>& out, std::vector<schema_ptr> schemas, api::timestamp_type ts, service::storage_proxy& sp) {
auto token_metadata = sp.get_token_metadata_ptr();
for (auto& schema: schemas) {
if (schema->is_view()) {
for (auto& host_id: token_metadata->get_topology().get_all_host_ids()) {
auto view_status_mut = co_await sp.system_keyspace().make_view_build_status_mutation(ts, {schema->ks_name(), schema->cf_name()}, host_id, db::view::build_status::SUCCESS);
out.push_back(std::move(view_status_mut));
}
}
}
}
static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request, service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, bool enforce_authorization) {
static future<executor::request_return_type> create_table_on_shard0(service::client_state&& client_state, tracing::trace_state_ptr trace_state, rjson::value request,
service::storage_proxy& sp, service::migration_manager& mm, gms::gossiper& gossiper, bool enforce_authorization, bool warn_authorization, stats& stats, const db::tablets_mode_t::mode tablets_mode) {
SCYLLA_ASSERT(this_shard_id() == 0);
// We begin by parsing and validating the content of the CreateTable
@@ -1703,7 +1763,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
set_table_creation_time(tags_map, db_clock::now());
builder.add_extension(db::tags_extension::NAME, ::make_shared<db::tags_extension>(tags_map));
co_await verify_create_permission(enforce_authorization, client_state);
co_await verify_create_permission(enforce_authorization, warn_authorization, client_state, stats);
schema_ptr schema = builder.build();
for (auto& view_builder : view_builders) {
@@ -1724,7 +1784,7 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
auto group0_guard = co_await mm.start_group0_operation();
auto ts = group0_guard.write_timestamp();
utils::chunked_vector<mutation> schema_mutations;
auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map, sp.features());
auto ksm = create_keyspace_metadata(keyspace_name, sp, gossiper, ts, tags_map, sp.features(), tablets_mode);
// Alternator Streams doesn't yet work when the table uses tablets (#23838)
if (stream_specification && stream_specification->IsObject()) {
auto stream_enabled = rjson::find(*stream_specification, "StreamEnabled");
@@ -1733,10 +1793,15 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
auto rs = locator::abstract_replication_strategy::create_replication_strategy(ksm->strategy_name(), params);
if (rs->uses_tablets()) {
co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
"If you want to use streams, create a table with vnodes by setting the tag 'experimental:initial_tablets' set to 'none'.");
"If you want to use streams, create a table with vnodes by setting the tag 'system:initial_tablets' set to 'none'.");
}
}
}
// Creating an index in tablets mode requires the rf_rack_valid_keyspaces option to be enabled.
// GSI and LSI indexes are based on materialized views which require this option to avoid consistency issues.
if (!view_builders.empty() && ksm->uses_tablets() && !sp.data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes and LocalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
try {
schema_mutations = service::prepare_new_keyspace_announcement(sp.local_db(), ksm, ts);
} catch (exceptions::already_exists_exception&) {
@@ -1754,6 +1819,9 @@ static future<executor::request_return_type> create_table_on_shard0(service::cli
schemas.push_back(view_builder.build());
}
co_await service::prepare_new_column_families_announcement(schema_mutations, sp, *ksm, schemas, ts);
if (ksm->uses_tablets()) {
co_await mark_view_schemas_as_built(schema_mutations, schemas, ts, sp);
}
// If a role is allowed to create a table, we must give it permissions to
// use (and eventually delete) the specific table it just created (and
@@ -1800,9 +1868,10 @@ future<executor::request_return_type> executor::create_table(client_state& clien
_stats.api_operations.create_table++;
elogger.trace("Creating table {}", request);
co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container(), client_state_other_shard = client_state.move_to_other_shard(), enforce_authorization = bool(_enforce_authorization)]
co_return co_await _mm.container().invoke_on(0, [&, tr = tracing::global_trace_state_ptr(trace_state), request = std::move(request), &sp = _proxy.container(), &g = _gossiper.container(), client_state_other_shard = client_state.move_to_other_shard(), enforce_authorization = bool(_enforce_authorization), warn_authorization = bool(_warn_authorization)]
(service::migration_manager& mm) mutable -> future<executor::request_return_type> {
co_return co_await create_table_on_shard0(client_state_other_shard.get(), tr, std::move(request), sp.local(), mm, g.local(), enforce_authorization);
const db::tablets_mode_t::mode tablets_mode = _proxy.data_dictionary().get_config().tablets_mode_for_new_keyspaces(); // type cast
co_return co_await create_table_on_shard0(client_state_other_shard.get(), tr, std::move(request), sp.local(), mm, g.local(), enforce_authorization, warn_authorization, _stats, std::move(tablets_mode));
});
}
@@ -1855,7 +1924,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
verify_billing_mode(request);
}
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), client_state_other_shard = client_state.move_to_other_shard(), empty_request]
co_return co_await _mm.container().invoke_on(0, [&p = _proxy.container(), request = std::move(request), gt = tracing::global_trace_state_ptr(std::move(trace_state)), enforce_authorization = bool(_enforce_authorization), warn_authorization = bool(_warn_authorization), client_state_other_shard = client_state.move_to_other_shard(), empty_request, &e = this->container()]
(service::migration_manager& mm) mutable -> future<executor::request_return_type> {
schema_ptr schema;
size_t retries = mm.get_concurrent_ddl_retries();
@@ -1886,7 +1955,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
if (stream_enabled->GetBool()) {
if (p.local().local_db().find_keyspace(tab->ks_name()).get_replication_strategy().uses_tablets()) {
co_return api_error::validation("Streams not yet supported on a table using tablets (issue #23838). "
"If you want to enable streams, re-create this table with vnodes (with the tag 'experimental:initial_tablets' set to 'none').");
"If you want to enable streams, re-create this table with vnodes (with the tag 'system:initial_tablets' set to 'none').");
}
if (tab->cdc_options().enabled()) {
co_return api_error::validation("Table already has an enabled stream: TableName: " + tab->cf_name());
@@ -1953,6 +2022,10 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation(fmt::format(
"LSI {} already exists in table {}, can't use same name for GSI", index_name, table_name));
}
if (p.local().local_db().find_keyspace(keyspace_name).get_replication_strategy().uses_tablets() &&
!p.local().data_dictionary().get_config().rf_rack_valid_keyspaces()) {
co_return api_error::validation("GlobalSecondaryIndexes with tablets require the rf_rack_valid_keyspaces option to be enabled.");
}
elogger.trace("Adding GSI {}", index_name);
// FIXME: read and handle "Projection" parameter. This will
@@ -2026,7 +2099,7 @@ future<executor::request_return_type> executor::update_table(client_state& clien
co_return api_error::validation("UpdateTable requires one of GlobalSecondaryIndexUpdates, StreamSpecification or BillingMode to be specified");
}
co_await verify_permission(enforce_authorization, client_state_other_shard.get(), schema, auth::permission::ALTER);
co_await verify_permission(enforce_authorization, warn_authorization, client_state_other_shard.get(), schema, auth::permission::ALTER, e.local()._stats);
auto m = co_await service::prepare_column_family_update_announcement(p.local(), schema, std::vector<view_ptr>(), group0_guard.write_timestamp());
for (view_ptr view : new_views) {
auto m2 = co_await service::prepare_new_view_announcement(p.local(), view, group0_guard.write_timestamp());
@@ -2789,7 +2862,7 @@ future<executor::request_return_type> executor::put_item(client_state& client_st
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = op->needs_read_before_write();
co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);
auto cas_shard = op->shard_for_execute(needs_read_before_write);
@@ -2892,7 +2965,7 @@ future<executor::request_return_type> executor::delete_item(client_state& client
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = _proxy.data_dictionary().get_config().alternator_force_read_before_write() || op->needs_read_before_write();
co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);
auto cas_shard = op->shard_for_execute(needs_read_before_write);
@@ -2954,12 +3027,15 @@ struct primary_key_equal {
// done is known prior to starting the operation). Nevertheless, we want to
// do this mutation via LWT to ensure that it is serialized with other LWT
// mutations to the same partition.
//
// The std::vector<put_or_delete_item> must remain alive until the
// storage_proxy::cas() future is resolved.
class put_or_delete_item_cas_request : public service::cas_request {
schema_ptr schema;
std::vector<put_or_delete_item> _mutation_builders;
const std::vector<put_or_delete_item>& _mutation_builders;
public:
put_or_delete_item_cas_request(schema_ptr s, std::vector<put_or_delete_item>&& b) :
schema(std::move(s)), _mutation_builders(std::move(b)) { }
put_or_delete_item_cas_request(schema_ptr s, const std::vector<put_or_delete_item>& b) :
schema(std::move(s)), _mutation_builders(b) { }
virtual ~put_or_delete_item_cas_request() = default;
virtual std::optional<mutation> apply(foreign_ptr<lw_shared_ptr<query::result>> qr, const query::partition_slice& slice, api::timestamp_type ts) override {
std::optional<mutation> ret;
@@ -2975,11 +3051,38 @@ public:
}
};
static future<> cas_write(service::storage_proxy& proxy, schema_ptr schema, service::cas_shard cas_shard, dht::decorated_key dk, std::vector<put_or_delete_item>&& mutation_builders,
service::client_state& client_state, tracing::trace_state_ptr trace_state, service_permit permit) {
future<> executor::cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
tracing::trace_state_ptr trace_state, service_permit permit)
{
if (!cas_shard.this_shard()) {
_stats.shard_bounce_for_lwt++;
return container().invoke_on(cas_shard.shard(), _ssg,
[cs = client_state.move_to_other_shard(),
&mb = mutation_builders,
&dk,
ks = schema->ks_name(),
cf = schema->cf_name(),
gt = tracing::global_trace_state_ptr(trace_state),
permit = std::move(permit)]
(executor& self) mutable {
return do_with(cs.get(), [&mb, &dk, ks = std::move(ks), cf = std::move(cf),
trace_state = tracing::trace_state_ptr(gt), &self]
(service::client_state& client_state) mutable {
auto schema = self._proxy.data_dictionary().find_schema(ks, cf);
service::cas_shard cas_shard(*schema, dk.token());
//FIXME: Instead of passing empty_service_permit() to the background operation,
// the current permit's lifetime should be prolonged, so that it's destructed
// only after all background operations are finished as well.
return self.cas_write(schema, std::move(cas_shard), dk, mb, client_state, std::move(trace_state), empty_service_permit());
});
});
}
auto timeout = executor::default_timeout();
auto op = seastar::make_shared<put_or_delete_item_cas_request>(schema, std::move(mutation_builders));
return proxy.cas(schema, std::move(cas_shard), op, nullptr, to_partition_ranges(dk),
auto op = seastar::make_shared<put_or_delete_item_cas_request>(schema, mutation_builders);
return _proxy.cas(schema, std::move(cas_shard), op, nullptr, to_partition_ranges(dk),
{timeout, std::move(permit), client_state, trace_state},
db::consistency_level::LOCAL_SERIAL, db::consistency_level::LOCAL_QUORUM,
timeout, timeout).discard_result();
@@ -3005,13 +3108,11 @@ struct schema_decorated_key_equal {
// FIXME: if we failed writing some of the mutations, need to return a list
// of these failed mutations rather than fail the whole write (issue #5650).
static future<> do_batch_write(service::storage_proxy& proxy,
smp_service_group ssg,
future<> executor::do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit,
stats& stats) {
service_permit permit) {
if (mutation_builders.empty()) {
return make_ready_future<>();
}
@@ -3031,7 +3132,7 @@ static future<> do_batch_write(service::storage_proxy& proxy,
for (auto& b : mutation_builders) {
mutations.push_back(b.second.build(b.first, now));
}
return proxy.mutate(std::move(mutations),
return _proxy.mutate(std::move(mutations),
db::consistency_level::LOCAL_QUORUM,
executor::default_timeout(),
trace_state,
@@ -3042,48 +3143,41 @@ static future<> do_batch_write(service::storage_proxy& proxy,
// Multiple mutations may be destined for the same partition, adding
// or deleting different items of one partition. Join them together
// because we can do them in one cas() call.
std::unordered_map<schema_decorated_key, std::vector<put_or_delete_item>, schema_decorated_key_hash, schema_decorated_key_equal>
key_builders(1, schema_decorated_key_hash{}, schema_decorated_key_equal{});
for (auto& b : mutation_builders) {
auto dk = dht::decorate_key(*b.first, b.second.pk());
auto [it, added] = key_builders.try_emplace(schema_decorated_key{b.first, dk});
using map_type = std::unordered_map<schema_decorated_key,
std::vector<put_or_delete_item>,
schema_decorated_key_hash,
schema_decorated_key_equal>;
auto key_builders = std::make_unique<map_type>(1, schema_decorated_key_hash{}, schema_decorated_key_equal{});
for (auto&& b : std::move(mutation_builders)) {
auto [it, added] = key_builders->try_emplace(schema_decorated_key {
.schema = b.first,
.dk = dht::decorate_key(*b.first, b.second.pk())
});
it->second.push_back(std::move(b.second));
}
return parallel_for_each(std::move(key_builders), [&proxy, &client_state, &stats, trace_state, ssg, permit = std::move(permit)] (auto& e) {
stats.write_using_lwt++;
auto* key_builders_ptr = key_builders.get();
return parallel_for_each(*key_builders_ptr, [this, &client_state, trace_state, permit = std::move(permit)] (const auto& e) {
_stats.write_using_lwt++;
auto desired_shard = service::cas_shard(*e.first.schema, e.first.dk.token());
if (desired_shard.this_shard()) {
return cas_write(proxy, e.first.schema, std::move(desired_shard), e.first.dk, std::move(e.second), client_state, trace_state, permit);
} else {
stats.shard_bounce_for_lwt++;
return proxy.container().invoke_on(desired_shard.shard(), ssg,
[cs = client_state.move_to_other_shard(),
mb = e.second,
dk = e.first.dk,
ks = e.first.schema->ks_name(),
cf = e.first.schema->cf_name(),
gt = tracing::global_trace_state_ptr(trace_state),
permit = std::move(permit)]
(service::storage_proxy& proxy) mutable {
return do_with(cs.get(), [&proxy, mb = std::move(mb), dk = std::move(dk), ks = std::move(ks), cf = std::move(cf),
trace_state = tracing::trace_state_ptr(gt)]
(service::client_state& client_state) mutable {
auto schema = proxy.data_dictionary().find_schema(ks, cf);
auto s = e.first.schema;
// The desired_shard on the original shard remains alive for the duration
// of cas_write on this shard and prevents any tablet operations.
// However, we need a local instance of cas_shard on this shard
// to pass it to sp::cas, so we just create a new one.
service::cas_shard cas_shard(*schema, dk.token());
//FIXME: Instead of passing empty_service_permit() to the background operation,
// the current permit's lifetime should be prolonged, so that it's destructed
// only after all background operations are finished as well.
return cas_write(proxy, schema, std::move(cas_shard), dk, std::move(mb), client_state, std::move(trace_state), empty_service_permit());
});
}).finally([desired_shard = std::move(desired_shard)]{});
}
});
static const auto* injection_name = "alternator_executor_batch_write_wait";
return utils::get_local_injector().inject(injection_name, [s = std::move(s)] (auto& handler) -> future<> {
const auto ks = handler.get("keyspace");
const auto cf = handler.get("table");
const auto shard = std::atoll(handler.get("shard")->data());
if (ks == s->ks_name() && cf == s->cf_name() && shard == this_shard_id()) {
elogger.info("{}: hit", injection_name);
co_await handler.wait_for_message(std::chrono::steady_clock::now() + std::chrono::minutes{5});
elogger.info("{}: continue", injection_name);
}
}).then([&e, desired_shard = std::move(desired_shard),
&client_state, trace_state = std::move(trace_state), permit = std::move(permit), this]() mutable
{
return cas_write(e.first.schema, std::move(desired_shard), e.first.dk,
std::move(e.second), client_state, std::move(trace_state), std::move(permit));
});
}).finally([key_builders = std::move(key_builders)]{});
}
}
@@ -3163,7 +3257,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
per_table_wcu.emplace_back(std::make_pair(per_table_stats, schema));
}
for (const auto& b : mutation_builders) {
co_await verify_permission(_enforce_authorization, client_state, b.first, auth::permission::MODIFY);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, b.first, auth::permission::MODIFY, _stats);
}
// If alternator_force_read_before_write is true we will first get the previous item size
// and only then do send the mutation.
@@ -3228,7 +3322,7 @@ future<executor::request_return_type> executor::batch_write_item(client_state& c
_stats.wcu_total[stats::DELETE_ITEM] += wcu_delete_units;
_stats.api_operations.batch_write_item_batch_total += total_items;
_stats.api_operations.batch_write_item_histogram.add(total_items);
co_await do_batch_write(_proxy, _ssg, std::move(mutation_builders), client_state, trace_state, std::move(permit), _stats);
co_await do_batch_write(std::move(mutation_builders), client_state, trace_state, std::move(permit));
// FIXME: Issue #5650: If we failed writing some of the updates,
// need to return a list of these failed updates in UnprocessedItems
// rather than fail the whole write (issue #5650).
@@ -3636,16 +3730,16 @@ future<std::vector<rjson::value>> executor::describe_multi_item(schema_ptr schem
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units) {
noncopyable_function<void(uint64_t)> item_callback) {
cql3::selection::result_set_builder builder(*selection, gc_clock::now());
query::result_view::consume(*query_result, slice, cql3::selection::result_set_builder::visitor(builder, *schema, *selection));
auto result_set = builder.build();
std::vector<rjson::value> ret;
for (auto& result_row : result_set->rows()) {
rjson::value item = rjson::empty_object();
rcu_consumed_capacity_counter consumed_capacity;
describe_single_item(*selection, result_row, *attrs_to_get, item, &consumed_capacity._total_bytes);
rcu_half_units += consumed_capacity.get_half_units();
uint64_t item_length_in_bytes = 0;
describe_single_item(*selection, result_row, *attrs_to_get, item, &item_length_in_bytes);
item_callback(item_length_in_bytes);
ret.push_back(std::move(item));
co_await coroutine::maybe_yield();
}
@@ -4365,7 +4459,7 @@ future<executor::request_return_type> executor::update_item(client_state& client
tracing::add_table_name(trace_state, op->schema()->ks_name(), op->schema()->cf_name());
const bool needs_read_before_write = _proxy.data_dictionary().get_config().alternator_force_read_before_write() || op->needs_read_before_write();
co_await verify_permission(_enforce_authorization, client_state, op->schema(), auth::permission::MODIFY);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, op->schema(), auth::permission::MODIFY, _stats);
auto cas_shard = op->shard_for_execute(needs_read_before_write);
@@ -4475,7 +4569,7 @@ future<executor::request_return_type> executor::get_item(client_state& client_st
const rjson::value* expression_attribute_names = rjson::find(request, "ExpressionAttributeNames");
verify_all_are_used(expression_attribute_names, used_attribute_names, "ExpressionAttributeNames", "GetItem");
rcu_consumed_capacity_counter add_capacity(request, cl == db::consistency_level::LOCAL_QUORUM);
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);
service::storage_proxy::coordinator_query_result qr =
co_await _proxy.query(
schema, std::move(command), std::move(partition_ranges), cl,
@@ -4584,7 +4678,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
};
std::vector<table_requests> requests;
std::vector<std::vector<uint64_t>> responses_sizes;
uint batch_size = 0;
for (auto it = request_items.MemberBegin(); it != request_items.MemberEnd(); ++it) {
table_requests rs(get_table_from_batch_request(_proxy, it));
@@ -4604,7 +4697,7 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
}
for (const table_requests& tr : requests) {
co_await verify_permission(_enforce_authorization, client_state, tr.schema, auth::permission::SELECT);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, tr.schema, auth::permission::SELECT, _stats);
}
_stats.api_operations.batch_get_item_batch_total += batch_size;
@@ -4612,11 +4705,10 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
// If we got here, all "requests" are valid, so let's start the
// requests for the different partitions all in parallel.
std::vector<future<std::vector<rjson::value>>> response_futures;
responses_sizes.resize(requests.size());
size_t responses_sizes_pos = 0;
for (const auto& rs : requests) {
responses_sizes[responses_sizes_pos].resize(rs.requests.size());
size_t pos = 0;
std::vector<uint64_t> consumed_rcu_half_units_per_table(requests.size());
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
bool is_quorum = rs.cl == db::consistency_level::LOCAL_QUORUM;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->api_operations.batch_get_item_histogram.add(rs.requests.size());
for (const auto &r : rs.requests) {
@@ -4639,16 +4731,17 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
auto command = ::make_lw_shared<query::read_command>(rs.schema->id(), rs.schema->version(), partition_slice, _proxy.get_max_result_size(partition_slice),
query::tombstone_limit(_proxy.get_tombstone_limit()));
command->allow_limit = db::allow_per_partition_rate_limit::yes;
const auto item_callback = [is_quorum, &rcus_per_table = consumed_rcu_half_units_per_table[i]](uint64_t size) {
rcus_per_table += rcu_consumed_capacity_counter::get_half_units(size, is_quorum);
};
future<std::vector<rjson::value>> f = _proxy.query(rs.schema, std::move(command), std::move(partition_ranges), rs.cl,
service::storage_proxy::coordinator_query_options(executor::default_timeout(), permit, client_state, trace_state)).then(
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, &response_size = responses_sizes[responses_sizes_pos][pos]] (service::storage_proxy::coordinator_query_result qr) mutable {
[schema = rs.schema, partition_slice = std::move(partition_slice), selection = std::move(selection), attrs_to_get = rs.attrs_to_get, item_callback = std::move(item_callback)] (service::storage_proxy::coordinator_query_result qr) mutable {
utils::get_local_injector().inject("alternator_batch_get_item", [] { throw std::runtime_error("batch_get_item injection"); });
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), response_size);
return describe_multi_item(std::move(schema), std::move(partition_slice), std::move(selection), std::move(qr.query_result), std::move(attrs_to_get), std::move(item_callback));
});
pos++;
response_futures.push_back(std::move(f));
}
responses_sizes_pos++;
}
// Wait for all requests to complete, and then return the response.
@@ -4660,14 +4753,11 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::value response = rjson::empty_object();
rjson::add(response, "Responses", rjson::empty_object());
rjson::add(response, "UnprocessedKeys", rjson::empty_object());
size_t rcu_half_units;
auto fut_it = response_futures.begin();
responses_sizes_pos = 0;
rjson::value consumed_capacity = rjson::empty_array();
for (const auto& rs : requests) {
for (size_t i = 0; i < requests.size(); i++) {
const table_requests& rs = requests[i];
std::string table = table_name(*rs.schema);
size_t pos = 0;
rcu_half_units = 0;
for (const auto &r : rs.requests) {
auto& pk = r.first;
auto& cks = r.second;
@@ -4682,7 +4772,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
for (rjson::value& json : results) {
rjson::push_back(response["Responses"][table], std::move(json));
}
rcu_half_units += rcu_consumed_capacity_counter::get_half_units(responses_sizes[responses_sizes_pos][pos], rs.cl == db::consistency_level::LOCAL_QUORUM);
} catch(...) {
eptr = std::current_exception();
// This read of potentially several rows in one partition,
@@ -4706,8 +4795,8 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::push_back(response["UnprocessedKeys"][table]["Keys"], std::move(*ck.second));
}
}
pos++;
}
uint64_t rcu_half_units = consumed_rcu_half_units_per_table[i];
_stats.rcu_half_units_total += rcu_half_units;
lw_shared_ptr<stats> per_table_stats = get_stats_from_schema(_proxy, *rs.schema);
per_table_stats->rcu_half_units_total += rcu_half_units;
@@ -4717,7 +4806,6 @@ future<executor::request_return_type> executor::batch_get_item(client_state& cli
rjson::add(entry, "CapacityUnits", rcu_half_units*0.5);
rjson::push_back(consumed_capacity, std::move(entry));
}
responses_sizes_pos++;
}
if (should_add_rcu) {
@@ -5029,13 +5117,15 @@ static rjson::value encode_paging_state(const schema& schema, const service::pag
}
auto pos = paging_state.get_position_in_partition();
if (pos.has_key()) {
auto exploded_ck = pos.key().explode();
auto exploded_ck_it = exploded_ck.begin();
for (const column_definition& cdef : schema.clustering_key_columns()) {
rjson::add_with_string_name(last_evaluated_key, std::string_view(cdef.name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef.name_as_text()];
rjson::add_with_string_name(key_entry, type_to_string(cdef.type), json_key_column_value(*exploded_ck_it, cdef));
++exploded_ck_it;
// Alternator itself allows at most one column in clustering key, but
// user can use Alternator api to access system tables which might have
// multiple clustering key columns. So we need to handle that case here.
auto cdef_it = schema.clustering_key_columns().begin();
for(const auto &exploded_ck : pos.key().explode()) {
rjson::add_with_string_name(last_evaluated_key, std::string_view(cdef_it->name_as_text()), rjson::empty_object());
rjson::value& key_entry = last_evaluated_key[cdef_it->name_as_text()];
rjson::add_with_string_name(key_entry, type_to_string(cdef_it->type), json_key_column_value(exploded_ck, *cdef_it));
++cdef_it;
}
}
// To avoid possible conflicts (and thus having to reserve these names) we
@@ -5069,10 +5159,11 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
filter filter,
query::partition_slice::option_set custom_opts,
service::client_state& client_state,
cql3::cql_stats& cql_stats,
alternator::stats& stats,
tracing::trace_state_ptr trace_state,
service_permit permit,
bool enforce_authorization) {
bool enforce_authorization,
bool warn_authorization) {
lw_shared_ptr<service::pager::paging_state> old_paging_state = nullptr;
tracing::trace(trace_state, "Performing a database query");
@@ -5099,7 +5190,7 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
old_paging_state = make_lw_shared<service::pager::paging_state>(pk, pos, query::max_partitions, query_id::create_null_id(), service::pager::paging_state::replicas_per_token_range{}, std::nullopt, 0);
}
co_await verify_permission(enforce_authorization, client_state, table_schema, auth::permission::SELECT);
co_await verify_permission(enforce_authorization, warn_authorization, client_state, table_schema, auth::permission::SELECT, stats);
auto regular_columns =
table_schema->regular_columns() | std::views::transform(&column_definition::id)
@@ -5134,10 +5225,10 @@ static future<executor::request_return_type> do_query(service::storage_proxy& pr
if (paging_state) {
rjson::add(items_descr, "LastEvaluatedKey", encode_paging_state(*table_schema, *paging_state));
}
if (has_filter){
cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
if (has_filter) {
stats.cql_stats.filtered_rows_read_total += p->stats().rows_read_total;
// update our "filtered_row_matched_total" for all the rows matched, despited the filter
cql_stats.filtered_rows_matched_total += size;
stats.cql_stats.filtered_rows_matched_total += size;
}
if (opt_items) {
if (opt_items->size() >= max_items_for_rapidjson_array) {
@@ -5261,7 +5352,7 @@ future<executor::request_return_type> executor::scan(client_state& client_state,
verify_all_are_used(expression_attribute_values, used_attribute_values, "ExpressionAttributeValues", "Scan");
return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
std::move(filter), query::partition_slice::option_set(), client_state, _stats.cql_stats, trace_state, std::move(permit), _enforce_authorization);
std::move(filter), query::partition_slice::option_set(), client_state, _stats, trace_state, std::move(permit), _enforce_authorization, _warn_authorization);
}
static dht::partition_range calculate_pk_bound(schema_ptr schema, const column_definition& pk_cdef, const rjson::value& comp_definition, const rjson::value& attrs) {
@@ -5742,7 +5833,7 @@ future<executor::request_return_type> executor::query(client_state& client_state
query::partition_slice::option_set opts;
opts.set_if<query::partition_slice::option::reversed>(!forward);
return do_query(_proxy, schema, exclusive_start_key, std::move(partition_ranges), std::move(ck_bounds), std::move(attrs_to_get), limit, cl,
std::move(filter), opts, client_state, _stats.cql_stats, std::move(trace_state), std::move(permit), _enforce_authorization);
std::move(filter), opts, client_state, _stats, std::move(trace_state), std::move(permit), _enforce_authorization, _warn_authorization);
}
future<executor::request_return_type> executor::list_tables(client_state& client_state, service_permit permit, rjson::value request) {
@@ -5870,7 +5961,8 @@ future<executor::request_return_type> executor::describe_continuous_backups(clie
// of nodes in the cluster: A cluster with 3 or more live nodes, gets RF=3.
// A smaller cluster (presumably, a test only), gets RF=1. The user may
// manually create the keyspace to override this predefined behavior.
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts, const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat) {
static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_view keyspace_name, service::storage_proxy& sp, gms::gossiper& gossiper, api::timestamp_type ts,
const std::map<sstring, sstring>& tags_map, const gms::feature_service& feat, const db::tablets_mode_t::mode tablets_mode) {
int endpoint_count = gossiper.num_endpoints();
int rf = 3;
if (endpoint_count < rf) {
@@ -5880,21 +5972,18 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
}
auto opts = get_network_topology_options(sp, gossiper, rf);
// Even if the "tablets" experimental feature is available, we currently
// do not enable tablets by default on Alternator tables because LWT is
// not yet fully supported with tablets.
// The user can override the choice of whether or not to use tablets at
// table-creation time by supplying the following tag with a numeric value
// (setting the value to 0 means enabling tablets with automatic selection
// of the best number of tablets).
// Whether to use tablets for the table (actually for the keyspace of the
// table) is determined by tablets_mode (taken from the configuration
// option "tablets_mode_for_new_keyspaces"), as well as the presence and
// the value of a per-table tag system:initial_tablets
// (INITIAL_TABLETS_TAG_KEY).
// Setting the tag with a numeric value will enable a specific initial number
// of tablets (setting the value to 0 means enabling tablets with
// an automatic selection of the best number of tablets).
// Setting this tag to any non-numeric value (e.g., an empty string or the
// word "none") will ask to disable tablets.
// If we make this tag a permanent feature, it will get a "system:" prefix -
// until then we give it the "experimental:" prefix to not commit to it.
static constexpr auto INITIAL_TABLETS_TAG_KEY = "experimental:initial_tablets";
// initial_tablets currently defaults to unset, so tablets will not be
// used by default on new Alternator tables. Change this initialization
// to 0 enable tablets by default, with automatic number of tablets.
// When vnodes are asked for by the tag value, but tablets are enforced by config,
// throw an exception to the client.
std::optional<unsigned> initial_tablets;
if (feat.tablets) {
auto it = tags_map.find(INITIAL_TABLETS_TAG_KEY);
@@ -5904,8 +5993,21 @@ static lw_shared_ptr<keyspace_metadata> create_keyspace_metadata(std::string_vie
// initial_tablets to a disengaged optional.
try {
initial_tablets = std::stol(tags_map.at(INITIAL_TABLETS_TAG_KEY));
} catch(...) {
} catch (...) {
if (tablets_mode == db::tablets_mode_t::mode::enforced) {
throw api_error::validation(format("Tag {} containing non-numerical value requests vnodes, but vnodes are forbidden by configuration option `tablets_mode_for_new_keyspaces: enforced`", INITIAL_TABLETS_TAG_KEY));
}
initial_tablets = std::nullopt;
elogger.trace("Following {} tag containing non-numerical value, Alternator will attempt to create a keyspace {} with vnodes.", INITIAL_TABLETS_TAG_KEY, keyspace_name);
}
} else {
// No per-table tag present, use the value from config
if (tablets_mode == db::tablets_mode_t::mode::enabled || tablets_mode == db::tablets_mode_t::mode::enforced) {
initial_tablets = 0;
elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with tablets.", keyspace_name);
} else {
initial_tablets = std::nullopt;
elogger.trace("Following the `tablets_mode_for_new_keyspaces` flag from the settings, Alternator will attempt to create a keyspace {} with vnodes.", keyspace_name);
}
}
}

View File

@@ -40,6 +40,7 @@ namespace cql3::selection {
namespace service {
class storage_proxy;
class cas_shard;
}
namespace cdc {
@@ -57,6 +58,7 @@ class schema_builder;
namespace alternator {
class rmw_operation;
class put_or_delete_item;
schema_ptr get_table(service::storage_proxy& proxy, const rjson::value& request);
bool is_alternator_keyspace(const sstring& ks_name);
@@ -139,6 +141,7 @@ class executor : public peering_sharded_service<executor> {
db::system_distributed_keyspace& _sdks;
cdc::metadata& _cdc_metadata;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
// An smp_service_group to be used for limiting the concurrency when
// forwarding Alternator request between shards - if necessary for LWT.
smp_service_group _ssg;
@@ -218,6 +221,16 @@ private:
static void describe_key_schema(rjson::value& parent, const schema&, std::unordered_map<std::string,std::string> * = nullptr, const std::map<sstring, sstring> *tags = nullptr);
future<> do_batch_write(
std::vector<std::pair<schema_ptr, put_or_delete_item>> mutation_builders,
service::client_state& client_state,
tracing::trace_state_ptr trace_state,
service_permit permit);
future<> cas_write(schema_ptr schema, service::cas_shard cas_shard, const dht::decorated_key& dk,
const std::vector<put_or_delete_item>& mutation_builders, service::client_state& client_state,
tracing::trace_state_ptr trace_state, service_permit permit);
public:
static void describe_key_schema(rjson::value& parent, const schema& schema, std::unordered_map<std::string,std::string>&, const std::map<sstring, sstring> *tags = nullptr);
@@ -228,12 +241,15 @@ public:
const std::optional<attrs_to_get>&,
uint64_t* = nullptr);
// Converts a multi-row selection result to JSON compatible with DynamoDB.
// For each row, this method calls item_callback, which takes the size of
// the item as the parameter.
static future<std::vector<rjson::value>> describe_multi_item(schema_ptr schema,
const query::partition_slice&& slice,
shared_ptr<cql3::selection::selection> selection,
foreign_ptr<lw_shared_ptr<query::result>> query_result,
shared_ptr<const std::optional<attrs_to_get>> attrs_to_get,
uint64_t& rcu_half_units);
noncopyable_function<void(uint64_t)> item_callback = {});
static void describe_single_item(const cql3::selection::selection&,
const std::vector<managed_bytes_opt>&,
@@ -261,7 +277,7 @@ bool is_big(const rjson::value& val, int big_size = 100'000);
// Check CQL's Role-Based Access Control (RBAC) permission (MODIFY,
// SELECT, DROP, etc.) on the given table. When permission is denied an
// appropriate user-readable api_error::access_denied is thrown.
future<> verify_permission(bool enforce_authorization, const service::client_state&, const schema_ptr&, auth::permission);
future<> verify_permission(bool enforce_authorization, bool warn_authorization, const service::client_state&, const schema_ptr&, auth::permission, alternator::stats& stats);
/**
* Make return type for serializing the object "streamed",

View File

@@ -282,15 +282,23 @@ std::string type_to_string(data_type type) {
return it->second;
}
bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
std::optional<bytes> try_get_key_column_value(const rjson::value& item, const column_definition& column) {
std::string column_name = column.name_as_text();
const rjson::value* key_typed_value = rjson::find(item, column_name);
if (!key_typed_value) {
throw api_error::validation(fmt::format("Key column {} not found", column_name));
return std::nullopt;
}
return get_key_from_typed_value(*key_typed_value, column);
}
bytes get_key_column_value(const rjson::value& item, const column_definition& column) {
auto value = try_get_key_column_value(item, column);
if (!value) {
throw api_error::validation(fmt::format("Key column {} not found", column.name_as_text()));
}
return std::move(*value);
}
// Parses the JSON encoding for a key value, which is a map with a single
// entry whose key is the type and the value is the encoded value.
// If this type does not match the desired "type_str", an api_error::validation
@@ -380,20 +388,38 @@ clustering_key ck_from_json(const rjson::value& item, schema_ptr schema) {
return clustering_key::make_empty();
}
std::vector<bytes> raw_ck;
// FIXME: this is a loop, but we really allow only one clustering key column.
// Note: it's possible to get more than one clustering column here, as
// Alternator can be used to read scylla internal tables.
for (const column_definition& cdef : schema->clustering_key_columns()) {
bytes raw_value = get_key_column_value(item, cdef);
auto raw_value = get_key_column_value(item, cdef);
raw_ck.push_back(std::move(raw_value));
}
return clustering_key::from_exploded(raw_ck);
}
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
auto ck = ck_from_json(item, schema);
if (is_alternator_keyspace(schema->ks_name())) {
return position_in_partition::for_key(std::move(ck));
clustering_key_prefix ck_prefix_from_json(const rjson::value& item, schema_ptr schema) {
if (schema->clustering_key_size() == 0) {
return clustering_key_prefix::make_empty();
}
std::vector<bytes> raw_ck;
for (const column_definition& cdef : schema->clustering_key_columns()) {
auto raw_value = try_get_key_column_value(item, cdef);
if (!raw_value) {
break;
}
raw_ck.push_back(std::move(*raw_value));
}
return clustering_key_prefix::from_exploded(raw_ck);
}
position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema) {
const bool is_alternator_ks = is_alternator_keyspace(schema->ks_name());
if (is_alternator_ks) {
return position_in_partition::for_key(ck_from_json(item, schema));
}
const auto region_item = rjson::find(item, scylla_paging_region);
const auto weight_item = rjson::find(item, scylla_paging_weight);
if (bool(region_item) != bool(weight_item)) {
@@ -413,8 +439,9 @@ position_in_partition pos_from_json(const rjson::value& item, schema_ptr schema)
} else {
throw std::runtime_error(fmt::format("Invalid value for weight: {}", weight_view));
}
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(std::move(ck)) : std::nullopt);
return position_in_partition(region, weight, region == partition_region::clustered ? std::optional(ck_prefix_from_json(item, schema)) : std::nullopt);
}
auto ck = ck_from_json(item, schema);
if (ck.is_empty()) {
return position_in_partition::for_partition_start();
}

View File

@@ -31,6 +31,7 @@
#include "utils/overloaded_functor.hh"
#include "utils/aws_sigv4.hh"
#include "client_data.hh"
#include "utils/updateable_value.hh"
static logging::logger slogger("alternator-server");
@@ -270,24 +271,57 @@ protected:
}
};
// This function increments the authentication_failures counter, and may also
// log a warn-level message and/or throw an exception, depending on what
// enforce_authorization and warn_authorization are set to.
// The username and client address are only used for logging purposes -
// they are not included in the error message returned to the client, since
// the client knows who it is.
// Note that if enforce_authorization is false, this function will return
// without throwing. So a caller that doesn't want to continue after an
// authentication_error must explicitly return after calling this function.
template<typename Exception>
static void authentication_error(alternator::stats& stats, bool enforce_authorization, bool warn_authorization, Exception&& e, std::string_view user, gms::inet_address client_address) {
stats.authentication_failures++;
if (enforce_authorization) {
if (warn_authorization) {
slogger.warn("alternator_warn_authorization=true: {} for user {}, client address {}", e.what(), user, client_address);
}
throw std::move(e);
} else {
if (warn_authorization) {
slogger.warn("If you set alternator_enforce_authorization=true the following will be enforced: {} for user {}, client address {}", e.what(), user, client_address);
}
}
}
future<std::string> server::verify_signature(const request& req, const chunked_content& content) {
if (!_enforce_authorization) {
if (!_enforce_authorization.get() && !_warn_authorization.get()) {
slogger.debug("Skipping authorization");
return make_ready_future<std::string>();
}
auto host_it = req._headers.find("Host");
if (host_it == req._headers.end()) {
throw api_error::invalid_signature("Host header is mandatory for signature verification");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature("Host header is mandatory for signature verification"),
"", req.get_client_address());
return make_ready_future<std::string>();
}
auto authorization_it = req._headers.find("Authorization");
if (authorization_it == req._headers.end()) {
throw api_error::missing_authentication_token("Authorization header is mandatory for signature verification");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::missing_authentication_token("Authorization header is mandatory for signature verification"),
"", req.get_client_address());
return make_ready_future<std::string>();
}
std::string host = host_it->second;
std::string_view authorization_header = authorization_it->second;
auto pos = authorization_header.find_first_of(' ');
if (pos == std::string_view::npos || authorization_header.substr(0, pos) != "AWS4-HMAC-SHA256") {
throw api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header));
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature(fmt::format("Authorization header must use AWS4-HMAC-SHA256 algorithm: {}", authorization_header)),
"", req.get_client_address());
return make_ready_future<std::string>();
}
authorization_header.remove_prefix(pos+1);
std::string credential;
@@ -322,7 +356,9 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
std::vector<std::string_view> credential_split = split(credential, '/');
if (credential_split.size() != 5) {
throw api_error::validation(fmt::format("Incorrect credential information format: {}", credential));
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::validation(fmt::format("Incorrect credential information format: {}", credential)), "", req.get_client_address());
return make_ready_future<std::string>();
}
std::string user(credential_split[0]);
std::string datestamp(credential_split[1]);
@@ -346,7 +382,7 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
auto cache_getter = [&proxy = _proxy, &as = _auth_service] (std::string username) {
return get_key_from_roles(proxy, as, std::move(username));
};
return _key_cache.get_ptr(user, cache_getter).then([this, &req, &content,
return _key_cache.get_ptr(user, cache_getter).then_wrapped([this, &req, &content,
user = std::move(user),
host = std::move(host),
datestamp = std::move(datestamp),
@@ -354,18 +390,32 @@ future<std::string> server::verify_signature(const request& req, const chunked_c
signed_headers_map = std::move(signed_headers_map),
region = std::move(region),
service = std::move(service),
user_signature = std::move(user_signature)] (key_cache::value_ptr key_ptr) {
user_signature = std::move(user_signature)] (future<key_cache::value_ptr> key_ptr_fut) {
key_cache::value_ptr key_ptr(nullptr);
try {
key_ptr = key_ptr_fut.get();
} catch (const api_error& e) {
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
e, user, req.get_client_address());
return std::string();
}
std::string signature;
try {
signature = utils::aws::get_signature(user, *key_ptr, std::string_view(host), "/", req._method,
datestamp, signed_headers_str, signed_headers_map, &content, region, service, "");
} catch (const std::exception& e) {
throw api_error::invalid_signature(e.what());
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::invalid_signature(fmt::format("invalid signature: {}", e.what())),
user, req.get_client_address());
return std::string();
}
if (signature != std::string_view(user_signature)) {
_key_cache.remove(user);
throw api_error::unrecognized_client("The security token included in the request is invalid.");
authentication_error(_executor._stats, _enforce_authorization.get(), _warn_authorization.get(),
api_error::unrecognized_client("wrong signature"),
user, req.get_client_address());
return std::string();
}
return user;
});
@@ -597,9 +647,11 @@ server::server(executor& exec, service::storage_proxy& proxy, gms::gossiper& gos
}
future<> server::init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests) {
_memory_limiter = memory_limiter;
_enforce_authorization = std::move(enforce_authorization);
_warn_authorization = std::move(warn_authorization);
_max_concurrent_requests = std::move(max_concurrent_requests);
if (!port && !https_port) {
return make_exception_future<>(std::runtime_error("Either regular port or TLS port"

View File

@@ -43,6 +43,7 @@ class server : public peering_sharded_service<server> {
key_cache _key_cache;
utils::updateable_value<bool> _enforce_authorization;
utils::updateable_value<bool> _warn_authorization;
utils::small_vector<std::reference_wrapper<seastar::httpd::http_server>, 2> _enabled_servers;
named_gate _pending_requests;
// In some places we will need a CQL updateable_timeout_config object even
@@ -94,7 +95,8 @@ public:
server(executor& executor, service::storage_proxy& proxy, gms::gossiper& gossiper, auth::service& service, qos::service_level_controller& sl_controller);
future<> init(net::inet_address addr, std::optional<uint16_t> port, std::optional<uint16_t> https_port, std::optional<tls::credentials_builder> creds,
utils::updateable_value<bool> enforce_authorization, semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
utils::updateable_value<bool> enforce_authorization, utils::updateable_value<bool> warn_authorization,
semaphore* memory_limiter, utils::updateable_value<uint32_t> max_concurrent_requests);
future<> stop();
// get_client_data() is called (on each shard separately) when the virtual
// table "system.clients" is read. It is expected to generate a list of

View File

@@ -176,6 +176,16 @@ static void register_metrics_with_optional_table(seastar::metrics::metric_groups
seastar::metrics::make_total_operations("expression_cache_misses", stats.expression_cache.requests[stats::expression_types::PROJECTION_EXPRESSION].misses,
seastar::metrics::description("Counts number of misses of cached expressions"), labels)(expression_label("ProjectionExpression")).aggregate(aggregate_labels).set_skip_when_empty()
});
// Only register the following metrics for the global metrics, not per-table
if (!has_table) {
metrics.add_group("alternator", {
seastar::metrics::make_counter("authentication_failures", stats.authentication_failures,
seastar::metrics::description("total number of authentication failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
seastar::metrics::make_counter("authorization_failures", stats.authorization_failures,
seastar::metrics::description("total number of authorization failures"), labels).aggregate({seastar::metrics::shard_label}).set_skip_when_empty(),
});
}
}
void register_metrics(seastar::metrics::metric_groups& metrics, const stats& stats) {

View File

@@ -79,6 +79,17 @@ public:
utils::estimated_histogram batch_get_item_histogram{22}; // a histogram that covers the range 1 - 100
utils::estimated_histogram batch_write_item_histogram{22}; // a histogram that covers the range 1 - 100
} api_operations;
// Count of authentication and authorization failures, counted if either
// alternator_enforce_authorization or alternator_warn_authorization are
// set to true. If both are false, no authentication or authorization
// checks are performed, so failures are not recognized or counted.
// "authentication" failure means the request was not signed with a valid
// user and key combination. "authorization" failure means the request was
// authenticated to a valid user - but this user did not have permissions
// to perform the operation (considering RBAC settings and the user's
// superuser status).
uint64_t authentication_failures = 0;
uint64_t authorization_failures = 0;
// Miscellaneous event counters
uint64_t total_operations = 0;
uint64_t unsupported_operations = 0;

View File

@@ -828,7 +828,7 @@ future<executor::request_return_type> executor::get_records(client_state& client
tracing::add_table_name(trace_state, schema->ks_name(), schema->cf_name());
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::SELECT);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::SELECT, _stats);
db::consistency_level cl = db::consistency_level::LOCAL_QUORUM;
partition_key pk = iter.shard.id.to_partition_key(*schema);

View File

@@ -94,7 +94,7 @@ future<executor::request_return_type> executor::update_time_to_live(client_state
}
sstring attribute_name(v->GetString(), v->GetStringLength());
co_await verify_permission(_enforce_authorization, client_state, schema, auth::permission::ALTER);
co_await verify_permission(_enforce_authorization, _warn_authorization, client_state, schema, auth::permission::ALTER, _stats);
co_await db::modify_tags(_mm, schema->ks_name(), schema->cf_name(), [&](std::map<sstring, sstring>& tags_map) {
if (enabled) {
if (tags_map.contains(TTL_TAG_KEY)) {
@@ -747,7 +747,7 @@ static future<bool> scan_table(
auto my_host_id = erm->get_topology().my_host_id();
const auto &tablet_map = erm->get_token_metadata().tablets().get_tablet_map(s->id());
for (std::optional tablet = tablet_map.first_tablet(); tablet; tablet = tablet_map.next_tablet(*tablet)) {
auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet);
auto tablet_primary_replica = tablet_map.get_primary_replica(*tablet, erm->get_topology());
// check if this is the primary replica for the current tablet
if (tablet_primary_replica.host == my_host_id && tablet_primary_replica.shard == this_shard_id()) {
co_await scan_tablet(*tablet, proxy, abort_source, page_sem, expiration_stats, scan_ctx, tablet_map);

View File

@@ -898,6 +898,14 @@
"type":"string",
"paramType":"query",
"enum": ["all", "dc", "rack", "node"]
},
{
"name":"primary_replica_only",
"description":"Load the sstables and stream to the primary replica node within the scope, if one is specified. If not, stream to the global primary replica.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
@@ -984,7 +992,7 @@
]
},
{
"path":"/storage_service/cleanup_all",
"path":"/storage_service/cleanup_all/",
"operations":[
{
"method":"POST",
@@ -994,6 +1002,30 @@
"produces":[
"application/json"
],
"parameters":[
{
"name":"global",
"description":"true if cleanup of entire cluster is requested",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}
]
},
{
"path":"/storage_service/mark_node_as_clean",
"operations":[
{
"method":"POST",
"summary":"Mark the node as clean. After that the node will not be considered as needing cleanup during automatic cleanup which is triggered by some topology operations",
"type":"void",
"nickname":"reset_cleanup_needed",
"produces":[
"application/json"
],
"parameters":[]
}
]
@@ -2924,7 +2956,7 @@
},
{
"name":"incremental_mode",
"description":"Set the incremental repair mode. Can be 'disabled', 'regular', or 'full'. 'regular': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to regular.",
"description":"Set the incremental repair mode. Can be 'disabled', 'incremental', or 'full'. 'incremental': The incremental repair logic is enabled. Unrepaired sstables will be included for repair. Repaired sstables will be skipped. The incremental repair states will be updated after repair. 'full': The incremental repair logic is enabled. Both repaired and unrepaired sstables will be included for repair. The incremental repair states will be updated after repair. 'disabled': The incremental repair logic is disabled completely. The incremental repair states, e.g., repaired_at in sstables and sstables_repaired_at in the system.tablets table, will not be updated after repair. When the option is not provided, it defaults to 'disabled' mode.",
"required":false,
"allowMultiple":false,
"type":"string",

View File

@@ -42,6 +42,14 @@
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
},
{
"name":"consider_only_existing_data",
"description":"Set to \"true\" to flush all memtables and force tombstone garbage collection to check only the sstables being compacted (false by default). The memtable, commitlog and other uncompacted sstables will not be checked during tombstone garbage collection.",
"required":false,
"allowMultiple":false,
"type":"boolean",
"paramType":"query"
}
]
}

View File

@@ -20,6 +20,7 @@
#include "utils/hash.hh"
#include <optional>
#include <sstream>
#include <stdexcept>
#include <time.h>
#include <algorithm>
#include <functional>
@@ -496,6 +497,7 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto bucket = req->get_query_param("bucket");
auto prefix = req->get_query_param("prefix");
auto scope = parse_stream_scope(req->get_query_param("scope"));
auto primary_replica_only = validate_bool_x(req->get_query_param("primary_replica_only"), false);
// TODO: the http_server backing the API does not use content streaming
// should use it for better performance
@@ -506,7 +508,7 @@ void set_sstables_loader(http_context& ctx, routes& r, sharded<sstables_loader>&
auto sstables = parsed.GetArray() |
std::views::transform([] (const auto& s) { return sstring(rjson::to_string_view(s)); }) |
std::ranges::to<std::vector>();
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope);
auto task_id = co_await sst_loader.local().download_new_sstables(keyspace, table, prefix, std::move(sstables), endpoint, bucket, scope, primary_replica_only);
co_return json::json_return_type(fmt::to_string(task_id));
});
@@ -723,8 +725,14 @@ rest_cdc_streams_check_and_repair(sharded<service::storage_service>& ss, std::un
static
future<json::json_return_type>
rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("cleanup_all");
auto done = co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
bool global = true;
if (auto global_param = req->get_query_param("global"); !global_param.empty()) {
global = validate_bool(global_param);
}
apilog.info("cleanup_all global={}", global);
auto done = !global ? false : co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<bool> {
if (!ss.is_topology_coordinator_enabled()) {
co_return false;
}
@@ -734,14 +742,35 @@ rest_cleanup_all(http_context& ctx, sharded<service::storage_service>& ss, std::
if (done) {
co_return json::json_return_type(0);
}
// fall back to the local global cleanup if topology coordinator is not enabled
// fall back to the local cleanup if topology coordinator is not enabled or local cleanup is requested
auto& db = ctx.db;
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::global_cleanup_compaction_task_impl>({}, db);
co_await task->done();
// Mark this node as clean
co_await ss.invoke_on(0, [] (service::storage_service& ss) -> future<> {
if (ss.is_topology_coordinator_enabled()) {
co_await ss.reset_cleanup_needed();
}
});
co_return json::json_return_type(0);
}
static
future<json::json_return_type>
rest_reset_cleanup_needed(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
apilog.info("reset_cleanup_needed");
co_await ss.invoke_on(0, [] (service::storage_service& ss) {
if (!ss.is_topology_coordinator_enabled()) {
throw std::runtime_error("mark_node_as_clean is only supported when topology over raft is enabled");
}
return ss.reset_cleanup_needed();
});
co_return json_void();
}
static
future<json::json_return_type>
rest_force_flush(http_context& ctx, std::unique_ptr<http::request> req) {
@@ -1723,6 +1752,7 @@ void set_storage_service(http_context& ctx, routes& r, sharded<service::storage_
ss::get_natural_endpoints.set(r, rest_bind(rest_get_natural_endpoints, ctx, ss));
ss::cdc_streams_check_and_repair.set(r, rest_bind(rest_cdc_streams_check_and_repair, ss));
ss::cleanup_all.set(r, rest_bind(rest_cleanup_all, ctx, ss));
ss::reset_cleanup_needed.set(r, rest_bind(rest_reset_cleanup_needed, ctx, ss));
ss::force_flush.set(r, rest_bind(rest_force_flush, ctx));
ss::force_keyspace_flush.set(r, rest_bind(rest_force_keyspace_flush, ctx));
ss::decommission.set(r, rest_bind(rest_decommission, ss));
@@ -1800,6 +1830,7 @@ void unset_storage_service(http_context& ctx, routes& r) {
ss::get_natural_endpoints.unset(r);
ss::cdc_streams_check_and_repair.unset(r);
ss::cleanup_all.unset(r);
ss::reset_cleanup_needed.unset(r);
ss::force_flush.unset(r);
ss::force_keyspace_flush.unset(r);
ss::decommission.unset(r);

View File

@@ -38,76 +38,78 @@ static auto wrap_ks_cf(http_context &ctx, ks_cf_func f) {
};
}
static future<shared_ptr<compaction::major_keyspace_compaction_task_impl>> force_keyspace_compaction(http_context& ctx, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush && !consider_only_existing_data) {
fmopt = compaction::flush_mode::skip;
}
return compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
}
static future<shared_ptr<compaction::upgrade_sstables_compaction_task_impl>> upgrade_sstables(http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
return compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
}
static future<shared_ptr<compaction::cleanup_keyspace_compaction_task_impl>> force_keyspace_cleanup(http_context& ctx, sharded<service::storage_service>& ss, std::unique_ptr<http::request> req) {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.is_local() || !rs.is_vnode_based()) {
auto reason = rs.is_local() ? "require" : "support";
apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
co_return nullptr;
}
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
co_return co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
}
void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::storage_service>& ss, sharded<db::snapshot_ctl>& snap_ctl) {
t::force_keyspace_compaction_async.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
apilog.debug("force_keyspace_compaction_async: keyspace={} tables={}, flush={}", keyspace, table_infos, flush);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush) {
fmopt = compaction::flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt);
auto task = co_await force_keyspace_compaction(ctx, std::move(req));
co_return json::json_return_type(task->get_status().id.to_sstring());
});
ss::force_keyspace_compaction.set(r, [&ctx](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [ keyspace, table_infos ] = parse_table_infos(ctx, *req, "cf");
auto flush = validate_bool_x(req->get_query_param("flush_memtables"), true);
auto consider_only_existing_data = validate_bool_x(req->get_query_param("consider_only_existing_data"), false);
apilog.info("force_keyspace_compaction: keyspace={} tables={}, flush={} consider_only_existing_data={}", keyspace, table_infos, flush, consider_only_existing_data);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
std::optional<compaction::flush_mode> fmopt;
if (!flush && !consider_only_existing_data) {
fmopt = compaction::flush_mode::skip;
}
auto task = co_await compaction_module.make_and_start_task<compaction::major_keyspace_compaction_task_impl>({}, std::move(keyspace), tasks::task_id::create_null_id(), db, table_infos, fmopt, consider_only_existing_data);
auto task = co_await force_keyspace_compaction(ctx, std::move(req));
co_await task->done();
co_return json_void();
});
t::force_keyspace_cleanup_async.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
apilog.info("force_keyspace_cleanup_async: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup_async: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
tasks::task_id id = tasks::task_id::create_null_id();
auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
if (task) {
id = task->get_status().id;
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>({}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
co_return json::json_return_type(task->get_status().id.to_sstring());
co_return json::json_return_type(id.to_sstring());
});
ss::force_keyspace_cleanup.set(r, [&ctx, &ss](std::unique_ptr<http::request> req) -> future<json::json_return_type> {
auto& db = ctx.db;
auto [keyspace, table_infos] = parse_table_infos(ctx, *req);
const auto& rs = db.local().find_keyspace(keyspace).get_replication_strategy();
if (rs.is_local() || !rs.is_vnode_based()) {
auto reason = rs.is_local() ? "require" : "support";
apilog.info("Keyspace {} does not {} cleanup", keyspace, reason);
co_return json::json_return_type(0);
auto task = co_await force_keyspace_cleanup(ctx, ss, std::move(req));
if (task) {
co_await task->done();
}
apilog.info("force_keyspace_cleanup: keyspace={} tables={}", keyspace, table_infos);
if (!co_await ss.local().is_cleanup_allowed(keyspace)) {
auto msg = "Can not perform cleanup operation when topology changes";
apilog.warn("force_keyspace_cleanup: keyspace={} tables={}: {}", keyspace, table_infos, msg);
co_await coroutine::return_exception(std::runtime_error(msg));
}
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::cleanup_keyspace_compaction_task_impl>(
{}, std::move(keyspace), db, table_infos, compaction::flush_mode::all_tables, tasks::is_user_task::yes);
co_await task->done();
co_return json::json_return_type(0);
});
@@ -129,25 +131,12 @@ void set_tasks_compaction_module(http_context& ctx, routes& r, sharded<service::
}));
t::upgrade_sstables_async.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
co_return json::json_return_type(task->get_status().id.to_sstring());
}));
ss::upgrade_sstables.set(r, wrap_ks_cf(ctx, [] (http_context& ctx, std::unique_ptr<http::request> req, sstring keyspace, std::vector<table_info> table_infos) -> future<json::json_return_type> {
auto& db = ctx.db;
bool exclude_current_version = req_param<bool>(*req, "exclude_current_version", false);
apilog.info("upgrade_sstables: keyspace={} tables={} exclude_current_version={}", keyspace, table_infos, exclude_current_version);
auto& compaction_module = db.local().get_compaction_manager().get_task_manager_module();
auto task = co_await compaction_module.make_and_start_task<compaction::upgrade_sstables_compaction_task_impl>({}, std::move(keyspace), db, table_infos, exclude_current_version);
auto task = co_await upgrade_sstables(ctx, std::move(req), std::move(keyspace), std::move(table_infos));
co_await task->done();
co_return json::json_return_type(0);
}));

View File

@@ -233,9 +233,9 @@ future<role_set> ldap_role_manager::query_granted(std::string_view grantee_name,
}
future<role_to_directly_granted_map>
ldap_role_manager::query_all_directly_granted() {
ldap_role_manager::query_all_directly_granted(::service::query_state& qs) {
role_to_directly_granted_map result;
auto roles = co_await query_all();
auto roles = co_await query_all(qs);
for (auto& role: roles) {
auto granted_set = co_await query_granted(role, recursive_role_query::no);
for (auto& granted: granted_set) {
@@ -247,8 +247,8 @@ ldap_role_manager::query_all_directly_granted() {
co_return result;
}
future<role_set> ldap_role_manager::query_all() {
return _std_mgr.query_all();
future<role_set> ldap_role_manager::query_all(::service::query_state& qs) {
return _std_mgr.query_all(qs);
}
future<> ldap_role_manager::create_role(std::string_view role_name) {
@@ -311,12 +311,12 @@ future<bool> ldap_role_manager::can_login(std::string_view role_name) {
}
future<std::optional<sstring>> ldap_role_manager::get_attribute(
std::string_view role_name, std::string_view attribute_name) {
return _std_mgr.get_attribute(role_name, attribute_name);
std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.get_attribute(role_name, attribute_name, qs);
}
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name) {
return _std_mgr.query_attribute_for_all(attribute_name);
future<role_manager::attribute_vals> ldap_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state& qs) {
return _std_mgr.query_attribute_for_all(attribute_name, qs);
}
future<> ldap_role_manager::set_attribute(

View File

@@ -75,9 +75,9 @@ class ldap_role_manager : public role_manager {
future<role_set> query_granted(std::string_view, recursive_role_query) override;
future<role_to_directly_granted_map> query_all_directly_granted() override;
future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
future<role_set> query_all() override;
future<role_set> query_all(::service::query_state&) override;
future<bool> exists(std::string_view) override;
@@ -85,9 +85,9 @@ class ldap_role_manager : public role_manager {
future<bool> can_login(std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view) override;
future<std::optional<sstring>> get_attribute(std::string_view, std::string_view, ::service::query_state&) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view) override;
future<role_manager::attribute_vals> query_attribute_for_all(std::string_view, ::service::query_state&) override;
future<> set_attribute(std::string_view, std::string_view, std::string_view, ::service::group0_batch& mc) override;

View File

@@ -78,11 +78,11 @@ future<role_set> maintenance_socket_role_manager::query_granted(std::string_view
return operation_not_supported_exception<role_set>("QUERY GRANTED");
}
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> maintenance_socket_role_manager::query_all_directly_granted(::service::query_state&) {
return operation_not_supported_exception<role_to_directly_granted_map>("QUERY ALL DIRECTLY GRANTED");
}
future<role_set> maintenance_socket_role_manager::query_all() {
future<role_set> maintenance_socket_role_manager::query_all(::service::query_state&) {
return operation_not_supported_exception<role_set>("QUERY ALL");
}
@@ -98,11 +98,11 @@ future<bool> maintenance_socket_role_manager::can_login(std::string_view role_na
return make_ready_future<bool>(true);
}
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> maintenance_socket_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<std::optional<sstring>>("GET ATTRIBUTE");
}
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name) {
future<role_manager::attribute_vals> maintenance_socket_role_manager::query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) {
return operation_not_supported_exception<role_manager::attribute_vals>("QUERY ATTRIBUTE");
}

View File

@@ -53,9 +53,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -63,9 +63,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -36,7 +36,8 @@ static const std::unordered_map<sstring, auth::permission> permission_names({
{"MODIFY", auth::permission::MODIFY},
{"AUTHORIZE", auth::permission::AUTHORIZE},
{"DESCRIBE", auth::permission::DESCRIBE},
{"EXECUTE", auth::permission::EXECUTE}});
{"EXECUTE", auth::permission::EXECUTE},
{"VECTOR_SEARCH_INDEXING", auth::permission::VECTOR_SEARCH_INDEXING}});
const sstring& auth::permissions::to_string(permission p) {
for (auto& v : permission_names) {

View File

@@ -33,6 +33,7 @@ enum class permission {
// data access
SELECT, // required for SELECT.
MODIFY, // required for INSERT, UPDATE, DELETE, TRUNCATE.
VECTOR_SEARCH_INDEXING, // required for SELECT from tables with vector indexes if SELECT permission is not granted.
// permission management
AUTHORIZE, // required for GRANT and REVOKE.
@@ -54,7 +55,8 @@ typedef enum_set<
permission::MODIFY,
permission::AUTHORIZE,
permission::DESCRIBE,
permission::EXECUTE>> permission_set;
permission::EXECUTE,
permission::VECTOR_SEARCH_INDEXING>> permission_set;
bool operator<(const permission_set&, const permission_set&);

View File

@@ -41,22 +41,26 @@ static const std::unordered_map<resource_kind, std::size_t> max_parts{
{resource_kind::functions, 2}};
static permission_set applicable_permissions(const data_resource_view& dv) {
if (dv.table()) {
return permission_set::of<
// We only support VECTOR_SEARCH_INDEXING permission for ALL KEYSPACES.
auto set = permission_set::of<
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
if (!dv.table()) {
set.add(permission_set::of<permission::CREATE>());
}
return permission_set::of<
permission::CREATE,
permission::ALTER,
permission::DROP,
permission::SELECT,
permission::MODIFY,
permission::AUTHORIZE>();
if (!dv.table() && !dv.keyspace()) {
set.add(permission_set::of<permission::VECTOR_SEARCH_INDEXING>());
}
return set;
}
static permission_set applicable_permissions(const role_resource_view& rv) {

View File

@@ -17,12 +17,17 @@
#include <seastar/core/format.hh>
#include <seastar/core/sstring.hh>
#include "auth/common.hh"
#include "auth/resource.hh"
#include "cql3/description.hh"
#include "seastarx.hh"
#include "exceptions/exceptions.hh"
#include "service/raft/raft_group0_client.hh"
namespace service {
class query_state;
};
namespace auth {
struct role_config final {
@@ -167,9 +172,9 @@ public:
/// (role2, role3)
/// }
///
virtual future<role_to_directly_granted_map> query_all_directly_granted() = 0;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<role_set> query_all() = 0;
virtual future<role_set> query_all(::service::query_state& = internal_distributed_query_state()) = 0;
virtual future<bool> exists(std::string_view role_name) = 0;
@@ -186,12 +191,12 @@ public:
///
/// \returns the value of the named attribute, if one is set.
///
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) = 0;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
///
/// \returns a mapping of each role's value for the named attribute, if one is set for the role.
///
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name) = 0;
virtual future<attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state& = internal_distributed_query_state()) = 0;
/// Sets `attribute_name` with `attribute_value` for `role_name`.
/// \returns an exceptional future with nonexistant_role if the role does not exist.

View File

@@ -231,6 +231,17 @@ struct command_desc {
} type_ = type::OTHER;
};
/// Similar to command_desc, but used in cases where multiple permissions allow the access to the resource.
struct command_desc_with_permission_set {
permission_set permission;
const ::auth::resource& resource;
enum class type {
ALTER_WITH_OPTS,
ALTER_SYSTEM_WITH_ALLOWED_OPTS,
OTHER
} type_ = type::OTHER;
};
///
/// Protected resources cannot be modified even if the performer has permissions to do so.
///

View File

@@ -663,21 +663,30 @@ future<role_set> standard_role_manager::query_granted(std::string_view grantee_n
});
}
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted() {
future<role_to_directly_granted_map> standard_role_manager::query_all_directly_granted(::service::query_state& qs) {
const sstring query = seastar::format("SELECT * FROM {}.{}",
get_auth_ks_name(_qp),
meta::role_members_table::name);
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::ONE,
qs,
cql3::query_processor::cache_internal::yes);
role_to_directly_granted_map roles_map;
co_await _qp.query_internal(query, [&roles_map] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
roles_map.insert({row.get_as<sstring>("member"), row.get_as<sstring>("role")});
co_return stop_iteration::no;
});
std::transform(
results->begin(),
results->end(),
std::inserter(roles_map, roles_map.begin()),
[] (const cql3::untyped_result_set_row& row) {
return std::make_pair(row.get_as<sstring>("member"), row.get_as<sstring>("role")); }
);
co_return roles_map;
}
future<role_set> standard_role_manager::query_all() {
future<role_set> standard_role_manager::query_all(::service::query_state& qs) {
const sstring query = seastar::format("SELECT {} FROM {}.{}",
meta::roles_table::role_col_name,
get_auth_ks_name(_qp),
@@ -695,7 +704,7 @@ future<role_set> standard_role_manager::query_all() {
const auto results = co_await _qp.execute_internal(
query,
db::consistency_level::QUORUM,
internal_distributed_query_state(),
qs,
cql3::query_processor::cache_internal::yes);
role_set roles;
@@ -727,11 +736,11 @@ future<bool> standard_role_manager::can_login(std::string_view role_name) {
});
}
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name) {
future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state& qs) {
const sstring query = seastar::format("SELECT name, value FROM {}.{} WHERE role = ? AND name = ?",
get_auth_ks_name(_qp),
meta::role_attributes_table::name);
const auto result_set = co_await _qp.execute_internal(query, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
const auto result_set = co_await _qp.execute_internal(query, db::consistency_level::ONE, qs, {sstring(role_name), sstring(attribute_name)}, cql3::query_processor::cache_internal::yes);
if (!result_set->empty()) {
const cql3::untyped_result_set_row &row = result_set->one();
co_return std::optional<sstring>(row.get_as<sstring>("value"));
@@ -739,11 +748,11 @@ future<std::optional<sstring>> standard_role_manager::get_attribute(std::string_
co_return std::optional<sstring>{};
}
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name) {
return query_all().then([this, attribute_name] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles)] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name] (sstring role) {
return get_attribute(role, attribute_name).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
future<role_manager::attribute_vals> standard_role_manager::query_attribute_for_all (std::string_view attribute_name, ::service::query_state& qs) {
return query_all(qs).then([this, attribute_name, &qs] (role_set roles) {
return do_with(attribute_vals{}, [this, attribute_name, roles = std::move(roles), &qs] (attribute_vals &role_to_att_val) {
return parallel_for_each(roles.begin(), roles.end(), [this, &role_to_att_val, attribute_name, &qs] (sstring role) {
return get_attribute(role, attribute_name, qs).then([&role_to_att_val, role] (std::optional<sstring> att_val) {
if (att_val) {
role_to_att_val.emplace(std::move(role), std::move(*att_val));
}
@@ -788,7 +797,7 @@ future<> standard_role_manager::remove_attribute(std::string_view role_name, std
future<std::vector<cql3::description>> standard_role_manager::describe_role_grants() {
std::vector<cql3::description> result{};
const auto grants = co_await query_all_directly_granted();
const auto grants = co_await query_all_directly_granted(internal_distributed_query_state());
result.reserve(grants.size());
for (const auto& [grantee_role, granted_role] : grants) {

View File

@@ -66,9 +66,9 @@ public:
virtual future<role_set> query_granted(std::string_view grantee_name, recursive_role_query) override;
virtual future<role_to_directly_granted_map> query_all_directly_granted() override;
virtual future<role_to_directly_granted_map> query_all_directly_granted(::service::query_state&) override;
virtual future<role_set> query_all() override;
virtual future<role_set> query_all(::service::query_state&) override;
virtual future<bool> exists(std::string_view role_name) override;
@@ -76,9 +76,9 @@ public:
virtual future<bool> can_login(std::string_view role_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name) override;
virtual future<std::optional<sstring>> get_attribute(std::string_view role_name, std::string_view attribute_name, ::service::query_state&) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name) override;
virtual future<role_manager::attribute_vals> query_attribute_for_all(std::string_view attribute_name, ::service::query_state&) override;
virtual future<> set_attribute(std::string_view role_name, std::string_view attribute_name, std::string_view attribute_value, ::service::group0_batch& mc) override;

View File

@@ -1209,6 +1209,23 @@ future<mutation> create_table_streams_mutation(table_id table, db_clock::time_po
co_return std::move(m);
}
future<mutation> create_table_streams_mutation(table_id table, db_clock::time_point stream_ts, const utils::chunked_vector<cdc::stream_id>& stream_ids, api::timestamp_type ts) {
auto s = db::system_keyspace::cdc_streams_state();
mutation m(s, partition_key::from_single_value(*s,
data_value(table.uuid()).serialize_nonnull()
));
m.set_static_cell("timestamp", stream_ts, ts);
for (const auto& sid : stream_ids) {
auto ck = clustering_key::from_singular(*s, dht::token::to_int64(sid.token()));
m.set_cell(ck, "stream_id", data_value(sid.to_bytes()), ts);
co_await coroutine::maybe_yield();
}
co_return std::move(m);
}
utils::chunked_vector<mutation>
make_drop_table_streams_mutations(table_id table, api::timestamp_type ts) {
utils::chunked_vector<mutation> mutations;
@@ -1235,32 +1252,50 @@ future<> generation_service::load_cdc_tablet_streams(std::optional<std::unordere
tables_to_process = _cdc_metadata.get_tables_with_cdc_tablet_streams() | std::ranges::to<std::unordered_set<table_id>>();
}
auto read_streams_state = [this] (const std::optional<std::unordered_set<table_id>>& tables, noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f) -> future<> {
auto read_streams_state = [this] (const std::optional<std::unordered_set<table_id>>& tables, noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f) -> future<> {
if (tables) {
for (auto table : *tables) {
co_await _sys_ks.local().read_cdc_streams_state(table, [&] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
co_await _sys_ks.local().read_cdc_streams_state(table, [&] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
return f(table, base_ts, std::move(base_stream_set));
});
}
} else {
co_await _sys_ks.local().read_cdc_streams_state(std::nullopt, [&] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
co_await _sys_ks.local().read_cdc_streams_state(std::nullopt, [&] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
return f(table, base_ts, std::move(base_stream_set));
});
}
};
co_await read_streams_state(changed_tables, [this, &tables_to_process] (table_id table, db_clock::time_point base_ts, std::vector<cdc::stream_id> base_stream_set) -> future<> {
co_await read_streams_state(changed_tables, [this, &tables_to_process] (table_id table, db_clock::time_point base_ts, utils::chunked_vector<cdc::stream_id> base_stream_set) -> future<> {
table_streams new_table_map;
auto append_stream = [&new_table_map] (db_clock::time_point stream_tp, std::vector<cdc::stream_id> stream_set) {
auto append_stream = [&new_table_map] (db_clock::time_point stream_tp, utils::chunked_vector<cdc::stream_id> stream_set) {
auto ts = std::chrono::duration_cast<api::timestamp_clock::duration>(stream_tp.time_since_epoch()).count();
new_table_map[ts] = committed_stream_set {stream_tp, std::move(stream_set)};
};
append_stream(base_ts, std::move(base_stream_set));
// if we already have a loaded streams map, and the base timestamp is unchanged, then read
// the history entries starting from the latest one we have and append it to the existing map.
// we can do it because we only append new rows with higher timestamps to the history table.
std::optional<std::reference_wrapper<const committed_stream_set>> from_streams;
std::optional<db_clock::time_point> from_ts;
const auto& all_streams = _cdc_metadata.get_all_tablet_streams();
if (auto it = all_streams.find(table); it != all_streams.end()) {
const auto& current_map = *it->second;
if (current_map.cbegin()->second.ts == base_ts) {
const auto& latest_entry = current_map.crbegin()->second;
from_streams = std::cref(latest_entry);
from_ts = latest_entry.ts;
}
}
co_await _sys_ks.local().read_cdc_streams_history(table, [&] (table_id tid, db_clock::time_point ts, cdc_stream_diff diff) -> future<> {
const auto& prev_stream_set = std::crbegin(new_table_map)->second.streams;
if (!from_ts) {
append_stream(base_ts, std::move(base_stream_set));
}
co_await _sys_ks.local().read_cdc_streams_history(table, from_ts, [&] (table_id tid, db_clock::time_point ts, cdc_stream_diff diff) -> future<> {
const auto& prev_stream_set = new_table_map.empty() ?
from_streams->get().streams : std::crbegin(new_table_map)->second.streams;
append_stream(ts, co_await cdc::metadata::construct_next_stream_set(
prev_stream_set, std::move(diff.opened_streams), diff.closed_streams));
@@ -1272,7 +1307,11 @@ future<> generation_service::load_cdc_tablet_streams(std::optional<std::unordere
new_table_map_copy[ts] = entry;
co_await coroutine::maybe_yield();
}
svc._cdc_metadata.load_tablet_streams_map(table, std::move(new_table_map_copy));
if (!from_ts) {
svc._cdc_metadata.load_tablet_streams_map(table, std::move(new_table_map_copy));
} else {
svc._cdc_metadata.append_tablet_streams_map(table, std::move(new_table_map_copy));
}
}));
tables_to_process.erase(table);
@@ -1306,7 +1345,7 @@ future<> generation_service::query_cdc_timestamps(table_id table, bool ascending
}
}
future<> generation_service::query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f) {
future<> generation_service::query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f) {
const auto& all_tables = _cdc_metadata.get_all_tablet_streams();
auto table_it = all_tables.find(table);
if (table_it == all_tables.end()) {
@@ -1363,8 +1402,8 @@ future<> generation_service::generate_tablet_resize_update(utils::chunked_vector
co_return;
}
std::vector<cdc::stream_id> new_streams;
new_streams.reserve(new_tablet_map.tablet_count());
utils::chunked_vector<cdc::stream_id> new_streams;
co_await utils::reserve_gently(new_streams, new_tablet_map.tablet_count());
for (auto tid : new_tablet_map.tablet_ids()) {
new_streams.emplace_back(new_tablet_map.get_last_token(tid), 0);
co_await coroutine::maybe_yield();
@@ -1386,4 +1425,113 @@ future<> generation_service::generate_tablet_resize_update(utils::chunked_vector
muts.emplace_back(std::move(mut));
}
future<utils::chunked_vector<mutation>> get_cdc_stream_gc_mutations(table_id table, db_clock::time_point base_ts, const utils::chunked_vector<cdc::stream_id>& base_stream_set, api::timestamp_type ts) {
utils::chunked_vector<mutation> muts;
muts.reserve(2);
auto gc_now = gc_clock::now();
auto tombstone_ts = ts - 1;
{
// write the new base stream set to cdc_streams_state
auto s = db::system_keyspace::cdc_streams_state();
mutation m(s, partition_key::from_single_value(*s,
data_value(table.uuid()).serialize_nonnull()
));
m.partition().apply(tombstone(tombstone_ts, gc_now));
m.set_static_cell("timestamp", data_value(base_ts), ts);
for (const auto& sid : base_stream_set) {
co_await coroutine::maybe_yield();
auto ck = clustering_key::from_singular(*s, dht::token::to_int64(sid.token()));
m.set_cell(ck, "stream_id", data_value(sid.to_bytes()), ts);
}
muts.emplace_back(std::move(m));
}
{
// remove all entries from cdc_streams_history up to the new base
auto s = db::system_keyspace::cdc_streams_history();
mutation m(s, partition_key::from_single_value(*s,
data_value(table.uuid()).serialize_nonnull()
));
auto range = query::clustering_range::make_ending_with({
clustering_key_prefix::from_single_value(*s, timestamp_type->decompose(base_ts)), true});
auto bv = bound_view::from_range(range);
m.partition().apply_delete(*s, range_tombstone{bv.first, bv.second, tombstone{ts, gc_now}});
muts.emplace_back(std::move(m));
}
co_return std::move(muts);
}
table_streams::const_iterator get_new_base_for_gc(const table_streams& streams_map, std::chrono::seconds ttl) {
// find the most recent timestamp that is older than ttl_seconds, which will become the new base.
// all streams with older timestamps can be removed because they are closed for more than ttl_seconds
// (they are all replaced by streams with the newer timestamp).
auto ts_upper_bound = db_clock::now() - ttl;
auto it = streams_map.begin();
while (it != streams_map.end()) {
auto next_it = std::next(it);
if (next_it == streams_map.end()) {
break;
}
auto next_tp = next_it->second.ts;
if (next_tp <= ts_upper_bound) {
// the next timestamp is older than ttl_seconds, so the current one is obsolete
it = next_it;
} else {
break;
}
}
return it;
}
future<utils::chunked_vector<mutation>> generation_service::garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts) {
const auto& table_streams = *_cdc_metadata.get_all_tablet_streams().at(table);
// if TTL is not provided by the caller then use the table's CDC TTL
auto base_schema = cdc::get_base_table(_db, *_db.find_schema(table));
ttl = ttl.or_else([&] -> std::optional<std::chrono::seconds> {
auto ttl_seconds = base_schema->cdc_options().ttl();
if (ttl_seconds > 0) {
return std::chrono::seconds(ttl_seconds);
} else {
// ttl=0 means no ttl
return std::nullopt;
}
});
if (!ttl) {
co_return utils::chunked_vector<mutation>{};
}
auto new_base_it = get_new_base_for_gc(table_streams, *ttl);
if (new_base_it == table_streams.begin() || new_base_it == table_streams.end()) {
// nothing to gc
co_return utils::chunked_vector<mutation>{};
}
for (auto it = table_streams.begin(); it != new_base_it; ++it) {
cdc_log.info("Garbage collecting CDC stream metadata for table {}: removing generation {} because it is older than the CDC TTL of {} seconds",
table, it->second.ts, *ttl);
}
co_return co_await get_cdc_stream_gc_mutations(table, new_base_it->second.ts, new_base_it->second.streams, ts);
}
future<> generation_service::garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts) {
for (auto table : _cdc_metadata.get_tables_with_cdc_tablet_streams()) {
co_await coroutine::maybe_yield();
auto table_muts = co_await garbage_collect_cdc_streams_for_table(table, std::nullopt, ts);
for (auto&& m : table_muts) {
muts.emplace_back(std::move(m));
}
}
}
} // namespace cdc

View File

@@ -143,12 +143,12 @@ stream_state read_stream_state(int8_t val);
struct committed_stream_set {
db_clock::time_point ts;
std::vector<cdc::stream_id> streams;
utils::chunked_vector<cdc::stream_id> streams;
};
struct cdc_stream_diff {
std::vector<stream_id> closed_streams;
std::vector<stream_id> opened_streams;
utils::chunked_vector<stream_id> closed_streams;
utils::chunked_vector<stream_id> opened_streams;
};
using table_streams = std::map<api::timestamp_type, committed_stream_set>;
@@ -220,8 +220,11 @@ future<utils::chunked_vector<mutation>> get_cdc_generation_mutations_v3(
size_t mutation_size_threshold, api::timestamp_type mutation_timestamp);
future<mutation> create_table_streams_mutation(table_id, db_clock::time_point, const locator::tablet_map&, api::timestamp_type);
future<mutation> create_table_streams_mutation(table_id, db_clock::time_point, const utils::chunked_vector<cdc::stream_id>&, api::timestamp_type);
utils::chunked_vector<mutation> make_drop_table_streams_mutations(table_id, api::timestamp_type ts);
future<mutation> get_switch_streams_mutation(table_id table, db_clock::time_point stream_ts, cdc_stream_diff diff, api::timestamp_type ts);
future<utils::chunked_vector<mutation>> get_cdc_stream_gc_mutations(table_id table, db_clock::time_point base_ts, const utils::chunked_vector<cdc::stream_id>& base_stream_set, api::timestamp_type ts);
table_streams::const_iterator get_new_base_for_gc(const table_streams&, std::chrono::seconds ttl);
} // namespace cdc

View File

@@ -149,10 +149,13 @@ public:
future<> load_cdc_tablet_streams(std::optional<std::unordered_set<table_id>> changed_tables);
future<> query_cdc_timestamps(table_id table, bool ascending, noncopyable_function<future<>(db_clock::time_point)> f);
future<> query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f);
future<> query_cdc_streams(table_id table, noncopyable_function<future<>(db_clock::time_point, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff)> f);
future<> generate_tablet_resize_update(utils::chunked_vector<canonical_mutation>& muts, table_id table, const locator::tablet_map& new_tablet_map, api::timestamp_type ts);
future<utils::chunked_vector<mutation>> garbage_collect_cdc_streams_for_table(table_id table, std::optional<std::chrono::seconds> ttl, api::timestamp_type ts);
future<> garbage_collect_cdc_streams(utils::chunked_vector<canonical_mutation>& muts, api::timestamp_type ts);
private:
/* Retrieve the CDC generation which starts at the given timestamp (from a distributed table created for this purpose)
* and start using it for CDC log writes if it's not obsolete.

View File

@@ -67,10 +67,15 @@ shared_ptr<locator::abstract_replication_strategy> generate_replication_strategy
return locator::abstract_replication_strategy::create_replication_strategy(ksm.strategy_name(), params);
}
// When dropping a column from a CDC log table, we set the drop timestamp
// `column_drop_leeway` seconds into the future to ensure that for writes concurrent
// with column drop, the write timestamp is before the column drop timestamp.
constexpr auto column_drop_leeway = std::chrono::seconds(5);
} // anonymous namespace
namespace cdc {
static schema_ptr create_log_schema(const schema&, const replica::database&, const keyspace_metadata&,
static schema_ptr create_log_schema(const schema&, const replica::database&, const keyspace_metadata&, api::timestamp_type,
std::optional<table_id> = {}, schema_ptr = nullptr);
}
@@ -182,7 +187,7 @@ public:
muts.emplace_back(std::move(mut));
}
void on_pre_create_column_families(const keyspace_metadata& ksm, std::vector<schema_ptr>& cfms) override {
void on_pre_create_column_families(const keyspace_metadata& ksm, std::vector<schema_ptr>& cfms, api::timestamp_type ts) override {
std::vector<schema_ptr> new_cfms;
for (auto sp : cfms) {
@@ -201,7 +206,7 @@ public:
}
// in seastar thread
auto log_schema = create_log_schema(schema, db, ksm);
auto log_schema = create_log_schema(schema, db, ksm, ts);
new_cfms.push_back(std::move(log_schema));
}
@@ -248,7 +253,7 @@ public:
}
std::optional<table_id> maybe_id = log_schema ? std::make_optional(log_schema->id()) : std::nullopt;
auto new_log_schema = create_log_schema(new_schema, db, *keyspace.metadata(), std::move(maybe_id), log_schema);
auto new_log_schema = create_log_schema(new_schema, db, *keyspace.metadata(), timestamp, std::move(maybe_id), log_schema);
auto log_mut = log_schema
? db::schema_tables::make_update_table_mutations(_ctxt._proxy, keyspace.metadata(), log_schema, new_log_schema, timestamp)
@@ -580,7 +585,7 @@ bytes log_data_column_deleted_elements_name_bytes(const bytes& column_name) {
}
static schema_ptr create_log_schema(const schema& s, const replica::database& db,
const keyspace_metadata& ksm, std::optional<table_id> uuid, schema_ptr old)
const keyspace_metadata& ksm, api::timestamp_type timestamp, std::optional<table_id> uuid, schema_ptr old)
{
schema_builder b(s.ks_name(), log_name(s.cf_name()));
b.with_partitioner(cdc::cdc_partitioner::classname);
@@ -616,6 +621,28 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
b.with_column(log_meta_column_name_bytes("ttl"), long_type);
b.with_column(log_meta_column_name_bytes("end_of_batch"), boolean_type);
b.set_caching_options(caching_options::get_disabled_caching_options());
auto validate_new_column = [&] (const sstring& name) {
// When dropping a column from a CDC log table, we set the drop timestamp to be
// `column_drop_leeway` seconds into the future (see `create_log_schema`).
// Therefore, when recreating a column with the same name, we need to validate
// that it's not recreated too soon and that the drop timestamp has passed.
if (old && old->dropped_columns().contains(name)) {
const auto& drop_info = old->dropped_columns().at(name);
auto create_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(timestamp));
auto drop_time = api::timestamp_clock::time_point(api::timestamp_clock::duration(drop_info.timestamp));
if (drop_time > create_time) {
throw exceptions::invalid_request_exception(format("Cannot add column {} because a column with the same name was dropped too recently. Please retry after {} seconds",
name, std::chrono::duration_cast<std::chrono::seconds>(drop_time - create_time).count() + 1));
}
}
};
auto add_column = [&] (sstring name, data_type type) {
validate_new_column(name);
b.with_column(to_bytes(name), type);
};
auto add_columns = [&] (const schema::const_iterator_range_type& columns, bool is_data_col = false) {
for (const auto& column : columns) {
auto type = column.type;
@@ -637,9 +664,9 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
}
));
}
b.with_column(log_data_column_name_bytes(column.name()), type);
add_column(log_data_column_name(column.name_as_text()), type);
if (is_data_col) {
b.with_column(log_data_column_deleted_name_bytes(column.name()), boolean_type);
add_column(log_data_column_deleted_name(column.name_as_text()), boolean_type);
}
if (column.type->is_multi_cell()) {
auto dtype = visit(*type, make_visitor(
@@ -655,7 +682,7 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
throw std::invalid_argument("Should not reach");
}
));
b.with_column(log_data_column_deleted_elements_name_bytes(column.name()), dtype);
add_column(log_data_column_deleted_elements_name(column.name_as_text()), dtype);
}
}
};
@@ -669,7 +696,7 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
}
auto rs = generate_replication_strategy(ksm);
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata()));
auto tombstone_gc_ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(*rs, db.get_token_metadata(), false));
b.add_extension(tombstone_gc_extension::NAME, std::move(tombstone_gc_ext));
/**
@@ -681,7 +708,8 @@ static schema_ptr create_log_schema(const schema& s, const replica::database& db
// not super efficient, but we don't do this often.
for (auto& col : old->all_columns()) {
if (!b.has_column({col.name(), col.name_as_text() })) {
b.without_column(col.name_as_text(), col.type, api::new_timestamp());
auto drop_ts = api::timestamp_clock::now() + column_drop_leeway;
b.without_column(col.name_as_text(), col.type, drop_ts.time_since_epoch().count());
}
}
}

View File

@@ -54,7 +54,7 @@ cdc::stream_id get_stream(
}
static cdc::stream_id get_stream(
const std::vector<cdc::stream_id>& streams,
const utils::chunked_vector<cdc::stream_id>& streams,
dht::token tok) {
if (streams.empty()) {
on_internal_error(cdc_log, "get_stream: streams empty");
@@ -159,7 +159,7 @@ cdc::stream_id cdc::metadata::get_vnode_stream(api::timestamp_type ts, dht::toke
return ret;
}
const std::vector<cdc::stream_id>& cdc::metadata::get_tablet_stream_set(table_id tid, api::timestamp_type ts) const {
const utils::chunked_vector<cdc::stream_id>& cdc::metadata::get_tablet_stream_set(table_id tid, api::timestamp_type ts) const {
auto now = api::new_timestamp();
if (ts > now + get_generation_leeway().count()) {
throw exceptions::invalid_request_exception(seastar::format(
@@ -259,10 +259,10 @@ bool cdc::metadata::prepare(db_clock::time_point tp) {
return !it->second;
}
future<std::vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
const std::vector<cdc::stream_id>& prev_stream_set,
std::vector<cdc::stream_id> opened,
const std::vector<cdc::stream_id>& closed) {
future<utils::chunked_vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
const utils::chunked_vector<cdc::stream_id>& prev_stream_set,
utils::chunked_vector<cdc::stream_id> opened,
const utils::chunked_vector<cdc::stream_id>& closed) {
if (closed.size() == prev_stream_set.size()) {
// all previous streams are closed, so the next stream set is just the opened streams.
@@ -273,8 +273,8 @@ future<std::vector<cdc::stream_id>> cdc::metadata::construct_next_stream_set(
// streams and removing the closed streams. we assume each stream set is
// sorted by token, and the result is sorted as well.
std::vector<cdc::stream_id> next_stream_set;
next_stream_set.reserve(prev_stream_set.size() + opened.size() - closed.size());
utils::chunked_vector<cdc::stream_id> next_stream_set;
co_await utils::reserve_gently(next_stream_set, prev_stream_set.size() + opened.size() - closed.size());
auto next_prev = prev_stream_set.begin();
auto next_closed = closed.begin();
@@ -306,6 +306,10 @@ void cdc::metadata::load_tablet_streams_map(table_id tid, table_streams new_tabl
_tablet_streams[tid] = make_lw_shared(std::move(new_table_map));
}
void cdc::metadata::append_tablet_streams_map(table_id tid, table_streams new_table_map) {
_tablet_streams[tid]->insert(std::make_move_iterator(new_table_map.begin()), std::make_move_iterator(new_table_map.end()));
}
void cdc::metadata::remove_tablet_streams_map(table_id tid) {
_tablet_streams.erase(tid);
}
@@ -314,8 +318,8 @@ std::vector<table_id> cdc::metadata::get_tables_with_cdc_tablet_streams() const
return _tablet_streams | std::views::keys | std::ranges::to<std::vector<table_id>>();
}
future<cdc::cdc_stream_diff> cdc::metadata::generate_stream_diff(const std::vector<stream_id>& before, const std::vector<stream_id>& after) {
std::vector<stream_id> closed, opened;
future<cdc::cdc_stream_diff> cdc::metadata::generate_stream_diff(const utils::chunked_vector<stream_id>& before, const utils::chunked_vector<stream_id>& after) {
utils::chunked_vector<stream_id> closed, opened;
auto before_it = before.begin();
auto after_it = after.begin();

View File

@@ -37,7 +37,9 @@ class metadata final {
using container_t = std::map<api::timestamp_type, std::optional<topology_description>>;
container_t _gens;
using table_streams_ptr = lw_shared_ptr<const table_streams>;
// per-table streams map for tables in tablets-based keyspaces.
// the streams map is shared with the virtual tables reader, hence we can only insert new entries to it, not erase.
using table_streams_ptr = lw_shared_ptr<table_streams>;
using tablet_streams_map = std::unordered_map<table_id, table_streams_ptr>;
tablet_streams_map _tablet_streams;
@@ -47,7 +49,7 @@ class metadata final {
container_t::const_iterator gen_used_at(api::timestamp_type ts) const;
const std::vector<stream_id>& get_tablet_stream_set(table_id tid, api::timestamp_type ts) const;
const utils::chunked_vector<stream_id>& get_tablet_stream_set(table_id tid, api::timestamp_type ts) const;
public:
/* Is a generation with the given timestamp already known or obsolete? It is obsolete if and only if
@@ -100,6 +102,7 @@ public:
bool prepare(db_clock::time_point ts);
void load_tablet_streams_map(table_id tid, table_streams new_table_map);
void append_tablet_streams_map(table_id tid, table_streams new_table_map);
void remove_tablet_streams_map(table_id tid);
const tablet_streams_map& get_all_tablet_streams() const {
@@ -108,14 +111,14 @@ public:
std::vector<table_id> get_tables_with_cdc_tablet_streams() const;
static future<std::vector<stream_id>> construct_next_stream_set(
const std::vector<cdc::stream_id>& prev_stream_set,
std::vector<cdc::stream_id> opened,
const std::vector<cdc::stream_id>& closed);
static future<utils::chunked_vector<stream_id>> construct_next_stream_set(
const utils::chunked_vector<cdc::stream_id>& prev_stream_set,
utils::chunked_vector<cdc::stream_id> opened,
const utils::chunked_vector<cdc::stream_id>& closed);
static future<cdc_stream_diff> generate_stream_diff(
const std::vector<stream_id>& before,
const std::vector<stream_id>& after);
const utils::chunked_vector<stream_id>& before,
const utils::chunked_vector<stream_id>& after);
};

View File

@@ -1506,13 +1506,15 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
co_return;
}
auto num_runs_for_compaction = [&, this] -> future<size_t> {
auto& cs = t.get_compaction_strategy();
auto cs = t.get_compaction_strategy();
auto desc = co_await cs.get_sstables_for_compaction(t, get_strategy_control());
co_return std::ranges::size(desc.sstables
| std::views::transform(std::mem_fn(&sstables::sstable::run_identifier))
| std::ranges::to<std::unordered_set>());
};
const auto threshold = size_t(std::max(schema->max_compaction_threshold(), 32));
const auto threshold = utils::get_local_injector().inject_parameter<size_t>("set_sstable_count_reduction_threshold")
.value_or(size_t(std::max(schema->max_compaction_threshold(), 32)));
auto count = co_await num_runs_for_compaction();
if (count <= threshold) {
cmlog.trace("No need to wait for sstable count reduction in {}: {} <= {}",
@@ -1527,9 +1529,7 @@ future<> compaction_manager::maybe_wait_for_sstable_count_reduction(compaction_g
auto& cstate = get_compaction_state(&t);
try {
while (can_perform_regular_compaction(t) && co_await num_runs_for_compaction() > threshold) {
co_await cstate.compaction_done.wait([this, &t] {
return !can_perform_regular_compaction(t);
});
co_await cstate.compaction_done.wait();
}
} catch (const broken_condition_variable&) {
co_return;

View File

@@ -804,9 +804,9 @@ compaction_strategy_state compaction_strategy_state::make(const compaction_strat
case compaction_strategy_type::incremental:
return compaction_strategy_state(default_empty_state{});
case compaction_strategy_type::leveled:
return compaction_strategy_state(leveled_compaction_strategy_state{});
return compaction_strategy_state(seastar::make_shared<leveled_compaction_strategy_state>());
case compaction_strategy_type::time_window:
return compaction_strategy_state(time_window_compaction_strategy_state{});
return compaction_strategy_state(seastar::make_shared<time_window_compaction_strategy_state>());
default:
throw std::runtime_error("strategy not supported");
}

View File

@@ -18,7 +18,7 @@ namespace compaction {
class compaction_strategy_state {
public:
struct default_empty_state {};
using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state, time_window_compaction_strategy_state>;
using states_variant = std::variant<default_empty_state, leveled_compaction_strategy_state_ptr, time_window_compaction_strategy_state_ptr>;
private:
states_variant _state;
public:

View File

@@ -14,12 +14,12 @@
namespace compaction {
leveled_compaction_strategy_state& leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state>();
leveled_compaction_strategy_state_ptr leveled_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<leveled_compaction_strategy_state_ptr>();
}
future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto candidates = co_await control.candidates(table_s);
// NOTE: leveled_manifest creation may be slightly expensive, so later on,
// we may want to store it in the strategy itself. However, the sstable
@@ -27,10 +27,10 @@ future<compaction_descriptor> leveled_compaction_strategy::get_sstables_for_comp
// sstable in it may be marked for deletion after compacted.
// Currently, we create a new manifest whenever it's time for compaction.
leveled_manifest manifest = leveled_manifest::create(table_s, candidates, _max_sstable_size_in_mb, _stcs_options);
if (!state.last_compacted_keys) {
generate_last_compacted_keys(state, manifest);
if (!state->last_compacted_keys) {
generate_last_compacted_keys(*state, manifest);
}
auto candidate = manifest.get_compaction_candidates(*state.last_compacted_keys, state.compaction_counter);
auto candidate = manifest.get_compaction_candidates(*state->last_compacted_keys, state->compaction_counter);
if (!candidate.sstables.empty()) {
auto main_set = co_await table_s.main_sstable_set();
@@ -78,12 +78,12 @@ compaction_descriptor leveled_compaction_strategy::get_major_compaction_job(comp
}
void leveled_compaction_strategy::notify_completion(compaction_group_view& table_s, const std::vector<sstables::shared_sstable>& removed, const std::vector<sstables::shared_sstable>& added) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
// All the update here is only relevant for regular compaction's round-robin picking policy, and if
// last_compacted_keys wasn't generated by regular, it means regular is disabled since last restart,
// therefore we can skip the updates here until regular runs for the first time. Once it runs,
// it will be able to generate last_compacted_keys correctly by looking at metadata of files.
if (removed.empty() || added.empty() || !state.last_compacted_keys) {
if (removed.empty() || added.empty() || !state->last_compacted_keys) {
return;
}
auto min_level = std::numeric_limits<uint32_t>::max();
@@ -99,16 +99,16 @@ void leveled_compaction_strategy::notify_completion(compaction_group_view& table
}
target_level = std::max(target_level, int(candidate->get_sstable_level()));
}
state.last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();
state->last_compacted_keys.value().at(min_level) = last->get_last_decorated_key();
for (int i = leveled_manifest::MAX_LEVELS - 1; i > 0; i--) {
state.compaction_counter[i]++;
state->compaction_counter[i]++;
}
state.compaction_counter[target_level] = 0;
state->compaction_counter[target_level] = 0;
if (leveled_manifest::logger.level() == logging::log_level::debug) {
for (auto j = 0U; j < state.compaction_counter.size(); j++) {
leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state.compaction_counter[j]);
for (auto j = 0U; j < state->compaction_counter.size(); j++) {
leveled_manifest::logger.debug("CompactionCounter: {}: {}", j, state->compaction_counter[j]);
}
}
}

View File

@@ -36,6 +36,8 @@ struct leveled_compaction_strategy_state {
leveled_compaction_strategy_state();
};
using leveled_compaction_strategy_state_ptr = seastar::shared_ptr<leveled_compaction_strategy_state>;
class leveled_compaction_strategy : public compaction_strategy_impl {
static constexpr int32_t DEFAULT_MAX_SSTABLE_SIZE_IN_MB = 160;
static constexpr auto SSTABLE_SIZE_OPTION = "sstable_size_in_mb";
@@ -45,7 +47,7 @@ class leveled_compaction_strategy : public compaction_strategy_impl {
private:
int32_t calculate_max_sstable_size_in_mb(std::optional<sstring> option_value) const;
leveled_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
leveled_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;
public:
static unsigned ideal_level_for_input(const std::vector<sstables::shared_sstable>& input, uint64_t max_sstable_size);
static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);

View File

@@ -13,6 +13,7 @@
#include "sstables/sstables.hh"
#include "sstables/sstable_set_impl.hh"
#include "compaction_strategy_state.hh"
#include "utils/error_injection.hh"
#include <ranges>
@@ -22,8 +23,8 @@ extern logging::logger clogger;
using timestamp_type = api::timestamp_type;
time_window_compaction_strategy_state& time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state>();
time_window_compaction_strategy_state_ptr time_window_compaction_strategy::get_state(compaction_group_view& table_s) const {
return table_s.get_compaction_strategy_state().get<time_window_compaction_strategy_state_ptr>();
}
const std::unordered_map<sstring, std::chrono::seconds> time_window_compaction_strategy_options::valid_window_units = {
@@ -335,7 +336,7 @@ time_window_compaction_strategy::get_reshaping_job(std::vector<sstables::shared_
future<compaction_descriptor>
time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_view& table_s, strategy_control& control) {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto compaction_time = gc_clock::now();
auto candidates = co_await control.candidates(table_s);
@@ -344,7 +345,7 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
}
auto now = db_clock::now();
if (now - state.last_expired_check > _options.expired_sstable_check_frequency) {
if (now - state->last_expired_check > _options.expired_sstable_check_frequency) {
clogger.debug("[{}] TWCS expired check sufficiently far in the past, checking for fully expired SSTables", fmt::ptr(this));
// Find fully expired SSTables. Those will be included no matter what.
@@ -356,12 +357,14 @@ time_window_compaction_strategy::get_sstables_for_compaction(compaction_group_vi
// Keep checking for fully_expired_sstables until we don't find
// any among the candidates, meaning they are either already compacted
// or registered for compaction.
state.last_expired_check = now;
state->last_expired_check = now;
} else {
clogger.debug("[{}] TWCS skipping check for fully expired SSTables", fmt::ptr(this));
}
auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time);
co_await utils::get_local_injector().inject("twcs_get_sstables_for_compaction", utils::wait_for_message(30s));
auto compaction_candidates = get_next_non_expired_sstables(table_s, control, std::move(candidates), compaction_time, *state);
clogger.debug("[{}] Going to compact {} non-expired sstables", fmt::ptr(this), compaction_candidates.size());
co_return compaction_descriptor(std::move(compaction_candidates));
}
@@ -384,8 +387,8 @@ time_window_compaction_strategy::compaction_mode(const time_window_compaction_st
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables);
std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state) {
auto most_interesting = get_compaction_candidates(table_s, control, non_expiring_sstables, state);
if (!most_interesting.empty()) {
return most_interesting;
@@ -410,14 +413,14 @@ time_window_compaction_strategy::get_next_non_expired_sstables(compaction_group_
}
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables) {
auto& state = get_state(table_s);
time_window_compaction_strategy::get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state) {
auto [buckets, max_timestamp] = get_buckets(std::move(candidate_sstables), _options);
// Update the highest window seen, if necessary
state.highest_window_seen = std::max(state.highest_window_seen, max_timestamp);
return newest_bucket(table_s, control, std::move(buckets), table_s.min_compaction_threshold(), table_s.schema()->max_compaction_threshold(),
state.highest_window_seen);
state.highest_window_seen, state);
}
timestamp_type
@@ -465,8 +468,7 @@ namespace compaction {
std::vector<sstables::shared_sstable>
time_window_compaction_strategy::newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, timestamp_type now) {
auto& state = get_state(table_s);
int min_threshold, int max_threshold, timestamp_type now, time_window_compaction_strategy_state& state) {
clogger.debug("time_window_compaction_strategy::newest_bucket:\n now {}\n{}", now, buckets);
for (auto&& [key, bucket] : buckets | std::views::reverse) {
@@ -517,7 +519,7 @@ time_window_compaction_strategy::trim_to_threshold(std::vector<sstables::shared_
}
future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(compaction_group_view& table_s) const {
auto& state = get_state(table_s);
auto state = get_state(table_s);
auto min_threshold = table_s.min_compaction_threshold();
auto max_threshold = table_s.schema()->max_compaction_threshold();
auto main_set = co_await table_s.main_sstable_set();
@@ -526,7 +528,7 @@ future<int64_t> time_window_compaction_strategy::estimated_pending_compactions(c
int64_t n = 0;
for (auto& [bucket_key, bucket] : buckets) {
switch (compaction_mode(state, bucket, bucket_key, max_timestamp, min_threshold)) {
switch (compaction_mode(*state, bucket, bucket_key, max_timestamp, min_threshold)) {
case bucket_compaction_mode::size_tiered:
n += size_tiered_compaction_strategy::estimated_pending_compactions(bucket, min_threshold, max_threshold, _stcs_options);
break;

View File

@@ -67,6 +67,8 @@ struct time_window_compaction_strategy_state {
std::unordered_set<api::timestamp_type> recent_active_windows;
};
using time_window_compaction_strategy_state_ptr = seastar::shared_ptr<time_window_compaction_strategy_state>;
class time_window_compaction_strategy : public compaction_strategy_impl {
time_window_compaction_strategy_options _options;
size_tiered_compaction_strategy_options _stcs_options;
@@ -87,7 +89,7 @@ public:
static void validate_options(const std::map<sstring, sstring>& options, std::map<sstring, sstring>& unchecked_options);
private:
time_window_compaction_strategy_state& get_state(compaction_group_view& table_s) const;
time_window_compaction_strategy_state_ptr get_state(compaction_group_view& table_s) const;
static api::timestamp_type
to_timestamp_type(time_window_compaction_strategy_options::timestamp_resolutions resolution, int64_t timestamp_from_sstable) {
@@ -110,9 +112,11 @@ private:
compaction_mode(const time_window_compaction_strategy_state&, const bucket_t& bucket, api::timestamp_type bucket_key, api::timestamp_type now, size_t min_threshold) const;
std::vector<sstables::shared_sstable>
get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables, gc_clock::time_point compaction_time);
get_next_non_expired_sstables(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> non_expiring_sstables,
gc_clock::time_point compaction_time, time_window_compaction_strategy_state& state);
std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control, std::vector<sstables::shared_sstable> candidate_sstables);
std::vector<sstables::shared_sstable> get_compaction_candidates(compaction_group_view& table_s, strategy_control& control,
std::vector<sstables::shared_sstable> candidate_sstables, time_window_compaction_strategy_state& state);
public:
// Find the lowest timestamp for window of given size
static api::timestamp_type
@@ -126,7 +130,7 @@ public:
std::vector<sstables::shared_sstable>
newest_bucket(compaction_group_view& table_s, strategy_control& control, std::map<api::timestamp_type, std::vector<sstables::shared_sstable>> buckets,
int min_threshold, int max_threshold, api::timestamp_type now);
int min_threshold, int max_threshold, api::timestamp_type now, time_window_compaction_strategy_state& state);
static std::vector<sstables::shared_sstable>
trim_to_threshold(std::vector<sstables::shared_sstable> bucket, int max_threshold);

View File

@@ -855,7 +855,7 @@ maintenance_socket: ignore
# enable_create_table_with_compact_storage: false
# Control tablets for new keyspaces.
# Can be set to: disabled|enabled
# Can be set to: disabled|enabled|enforced
#
# When enabled, newly created keyspaces will have tablets enabled by default.
# That can be explicitly disabled in the CREATE KEYSPACE query
@@ -888,9 +888,18 @@ rf_rack_valid_keyspaces: false
#
# Vector Store options
#
# A comma-separated list of URIs for the vector store using DNS name. Only HTTP schema is supported. Port number is mandatory.
# Default is empty, which means that the vector store is not used.
# HTTP and HTTPS schemes are supported. Port number is mandatory.
# If both `vector_store_primary_uri` and `vector_store_secondary_uri` are unset or empty, vector search is disabled.
#
# A comma-separated list of primary vector store node URIs. These nodes are preferred for vector search operations.
# vector_store_primary_uri: http://vector-store.dns.name:{port}
#
# A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.
# vector_store_secondary_uri: http://vector-store.dns.name:{port}
#
# Options for encrypted connections to the vector store. These options are used for HTTPS URIs in vector_store_primary_uri and vector_store_secondary_uri.
# vector_store_encryption_options:
# truststore: <not set, use system trust>
#
# io-streaming rate limiting

View File

@@ -640,7 +640,8 @@ raft_tests = set([
vector_search_tests = set([
'test/vector_search/vector_store_client_test',
'test/vector_search/load_balancer_test'
'test/vector_search/load_balancer_test',
'test/vector_search/client_test'
])
wasms = set([
@@ -1078,7 +1079,6 @@ scylla_core = (['message/messaging_service.cc',
'utils/s3/client.cc',
'utils/s3/retryable_http_client.cc',
'utils/s3/retry_strategy.cc',
'utils/s3/s3_retry_strategy.cc',
'utils/s3/credentials_providers/aws_credentials_provider.cc',
'utils/s3/credentials_providers/environment_aws_credentials_provider.cc',
'utils/s3/credentials_providers/instance_profile_credentials_provider.cc',
@@ -1263,6 +1263,9 @@ scylla_core = (['message/messaging_service.cc',
'utils/disk_space_monitor.cc',
'vector_search/vector_store_client.cc',
'vector_search/dns.cc',
'vector_search/client.cc',
'vector_search/clients.cc',
'vector_search/truststore.cc'
] + [Antlr3Grammar('cql3/Cql.g')] \
+ scylla_raft_core
)
@@ -1570,6 +1573,7 @@ deps['test/boost/combined_tests'] += [
'test/boost/query_processor_test.cc',
'test/boost/reader_concurrency_semaphore_test.cc',
'test/boost/repair_test.cc',
'test/boost/replicator_test.cc',
'test/boost/restrictions_test.cc',
'test/boost/role_manager_test.cc',
'test/boost/row_cache_test.cc',
@@ -1657,6 +1661,7 @@ deps['test/raft/discovery_test'] = ['test/raft/discovery_test.cc',
deps['test/vector_search/vector_store_client_test'] = ['test/vector_search/vector_store_client_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/load_balancer_test'] = ['test/vector_search/load_balancer_test.cc'] + scylla_tests_dependencies
deps['test/vector_search/client_test'] = ['test/vector_search/client_test.cc'] + scylla_tests_dependencies
wasm_deps = {}

View File

@@ -1224,7 +1224,7 @@ listPermissionsStatement returns [std::unique_ptr<list_permissions_statement> st
;
permission returns [auth::permission perm = auth::permission{}]
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE | K_EXECUTE)
: p=(K_CREATE | K_ALTER | K_DROP | K_SELECT | K_MODIFY | K_AUTHORIZE | K_DESCRIBE | K_EXECUTE | K_VECTOR_SEARCH_INDEXING)
{ $perm = auth::permissions::from_string($p.text); }
;
@@ -2398,6 +2398,8 @@ K_EXECUTE: E X E C U T E;
K_MUTATION_FRAGMENTS: M U T A T I O N '_' F R A G M E N T S;
K_VECTOR_SEARCH_INDEXING: V E C T O R '_' S E A R C H '_' I N D E X I N G;
// Case-insensitive alpha characters
fragment A: ('a'|'A');
fragment B: ('b'|'B');

View File

@@ -1349,7 +1349,7 @@ static managed_bytes reserialize_value(View value_bytes,
if (type.is_map()) {
std::vector<std::pair<managed_bytes, managed_bytes>> elements = partially_deserialize_map(value_bytes);
const map_type_impl mapt = dynamic_cast<const map_type_impl&>(type);
const map_type_impl& mapt = dynamic_cast<const map_type_impl&>(type);
const abstract_type& key_type = mapt.get_keys_type()->without_reversed();
const abstract_type& value_type = mapt.get_values_type()->without_reversed();
@@ -1391,7 +1391,7 @@ static managed_bytes reserialize_value(View value_bytes,
const vector_type_impl& vtype = dynamic_cast<const vector_type_impl&>(type);
std::vector<managed_bytes> elements = vtype.split_fragmented(value_bytes);
auto elements_type = vtype.get_elements_type()->without_reversed();
const auto& elements_type = vtype.get_elements_type()->without_reversed();
if (elements_type.bound_value_needs_to_be_reserialized()) {
for (size_t i = 0; i < elements.size(); i++) {

View File

@@ -1322,6 +1322,10 @@ const std::vector<expr::expression>& statement_restrictions::index_restrictions(
return _index_restrictions;
}
bool statement_restrictions::is_empty() const {
return !_where.has_value();
}
// Current score table:
// local and restrictions include full partition key: 2
// global: 1

View File

@@ -408,6 +408,8 @@ public:
/// Checks that the primary key restrictions don't contain null values, throws invalid_request_exception otherwise.
void validate_primary_key(const query_options& options) const;
bool is_empty() const;
};
statement_restrictions analyze_statement_restrictions(

View File

@@ -422,7 +422,14 @@ std::pair<schema_ptr, std::vector<view_ptr>> alter_table_statement::prepare_sche
throw exceptions::invalid_request_exception(format("The synchronous_updates option is only applicable to materialized views, not to base tables"));
}
_properties->apply_to_builder(cfm, std::move(schema_extensions), db, keyspace());
if (is_cdc_log_table) {
auto gc_opts = _properties->get_tombstone_gc_options(schema_extensions);
if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on CDC log tables.");
}
}
_properties->apply_to_builder(cfm, std::move(schema_extensions), db, keyspace(), !is_cdc_log_table);
}
break;

View File

@@ -55,8 +55,29 @@ view_ptr alter_view_statement::prepare_view(data_dictionary::database db) const
auto schema_extensions = _properties->make_schema_extensions(db.extensions());
_properties->validate(db, keyspace(), schema_extensions);
bool is_colocated = [&] {
if (!db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
return false;
}
auto base_schema = db.find_schema(schema->view_info()->base_id());
if (!base_schema) {
return false;
}
return std::ranges::equal(
schema->partition_key_columns(),
base_schema->partition_key_columns(),
[](const column_definition& a, const column_definition& b) { return a.name() == b.name(); });
}();
if (is_colocated) {
auto gc_opts = _properties->get_tombstone_gc_options(schema_extensions);
if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on co-located materialized view tables.");
}
}
auto builder = schema_builder(schema);
_properties->apply_to_builder(builder, std::move(schema_extensions), db, keyspace());
_properties->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);
if (builder.get_gc_grace_seconds() == 0) {
throw exceptions::invalid_request_exception(

View File

@@ -136,9 +136,7 @@ void cf_prop_defs::validate(const data_dictionary::database db, sstring ks_name,
throw exceptions::configuration_exception(sstring("Missing sub-option '") + compression_parameters::SSTABLE_COMPRESSION + "' for the '" + KW_COMPRESSION + "' option.");
}
compression_parameters cp(*compression_options);
cp.validate(
compression_parameters::dicts_feature_enabled(bool(db.features().sstable_compression_dicts)),
compression_parameters::dicts_usage_allowed(db.get_config().sstable_compression_dictionaries_allow_in_ddl()));
cp.validate(compression_parameters::dicts_feature_enabled(bool(db.features().sstable_compression_dicts)));
}
auto per_partition_rate_limit_options = get_per_partition_rate_limit_options(schema_extensions);
@@ -286,7 +284,7 @@ std::optional<db::tablet_options::map_type> cf_prop_defs::get_tablet_options() c
return std::nullopt;
}
void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name) const {
void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name, bool supports_repair) const {
if (has_property(KW_COMMENT)) {
builder.set_comment(get_string(KW_COMMENT, ""));
}
@@ -372,7 +370,7 @@ void cf_prop_defs::apply_to_builder(schema_builder& builder, schema::extensions_
}
// Set default tombstone_gc mode.
if (!schema_extensions.contains(tombstone_gc_extension::NAME)) {
auto ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(db, ks_name));
auto ext = seastar::make_shared<tombstone_gc_extension>(get_default_tombstone_gc_mode(db, ks_name, supports_repair));
schema_extensions.emplace(tombstone_gc_extension::NAME, std::move(ext));
}
builder.set_extensions(std::move(schema_extensions));

View File

@@ -110,7 +110,7 @@ public:
bool get_synchronous_updates_flag() const;
std::optional<db::tablet_options::map_type> get_tablet_options() const;
void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name) const;
void apply_to_builder(schema_builder& builder, schema::extensions_map schema_extensions, const data_dictionary::database& db, sstring ks_name, bool supports_repair) const;
void validate_minimum_int(const sstring& field, int32_t minimum_value, int32_t default_value) const;
};

View File

@@ -10,7 +10,10 @@
#include <seastar/core/coroutine.hh>
#include "create_index_statement.hh"
#include "db/config.hh"
#include "db/view/view.hh"
#include "exceptions/exceptions.hh"
#include "index/vector_index.hh"
#include "prepared_statement.hh"
#include "types/types.hh"
#include "validation.hh"
@@ -92,9 +95,17 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
throw exceptions::invalid_request_exception(format("index names shouldn't be more than {:d} characters long (got \"{}\")", schema::NAME_LENGTH, _index_name.c_str()));
}
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Secondary indexes are not supported on base tables with tablets (keyspace '{}')", keyspace()));
// Regular secondary indexes require rf-rack-validity.
// Custom indexes need to validate this property themselves, if they need it.
if (!_properties || !_properties->custom_class) {
try {
db::view::validate_view_keyspace(db, keyspace());
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
}
validate_for_local_index(*schema);
std::vector<::shared_ptr<index_target>> targets;
@@ -108,7 +119,7 @@ std::vector<::shared_ptr<index_target>> create_index_statement::validate_while_e
throw exceptions::invalid_request_exception(format("Non-supported custom class \'{}\' provided", *(_properties->custom_class)));
}
auto custom_index = (*custom_index_factory)();
custom_index->validate(*schema, *_properties, targets, db.features());
custom_index->validate(*schema, *_properties, targets, db.features(), db);
_properties->index_version = custom_index->index_version(*schema);
}
@@ -375,6 +386,15 @@ std::optional<create_index_statement::base_schema_with_new_index> create_index_s
format("Index {} is a duplicate of existing index {}", index.name(), existing_index.value().name()));
}
}
bool existing_vector_index = _properties->custom_class && _properties->custom_class == "vector_index" && secondary_index::vector_index::has_vector_index_on_column(*schema, targets[0]->column_name());
bool custom_index_with_same_name = _properties->custom_class && db.existing_index_names(keyspace()).contains(_index_name);
if (existing_vector_index || custom_index_with_same_name) {
if (_if_not_exists) {
return {};
} else {
throw exceptions::invalid_request_exception("There exists a duplicate custom index");
}
}
auto index_table_name = secondary_index::index_table_name(accepted_name);
if (db.has_schema(keyspace(), index_table_name)) {
// We print this error even if _if_not_exists - in this case the user

View File

@@ -113,8 +113,7 @@ future<std::tuple<::shared_ptr<cql_transport::event::schema_change>, utils::chun
if (rs->uses_tablets()) {
warnings.push_back(
"Tables in this keyspace will be replicated using Tablets "
"and will not support Materialized Views, Secondary Indexes and counters features. "
"To use Materialized Views, Secondary Indexes or counters, drop this keyspace and re-create it "
"and will not support counters features. To use counters, drop this keyspace and re-create it "
"without tablets by adding AND TABLETS = {'enabled': false} to the CREATE KEYSPACE statement.");
if (ksm->initial_tablets().value()) {
warnings.push_back("Keyspace `initial` tablets option is deprecated. Use per-table tablet options instead.");

View File

@@ -31,8 +31,6 @@
#include "db/config.hh"
#include "compaction/time_window_compaction_strategy.hh"
bool is_internal_keyspace(std::string_view name);
namespace cql3 {
namespace statements {
@@ -124,11 +122,7 @@ void create_table_statement::apply_properties_to(schema_builder& builder, const
addColumnMetadataFromAliases(cfmd, Collections.singletonList(valueAlias), defaultValidator, ColumnDefinition.Kind.COMPACT_VALUE);
#endif
if (!_properties->get_compression_options() && !is_internal_keyspace(keyspace())) {
builder.set_compressor_params(db.get_config().sstable_compression_user_table_options());
}
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace());
_properties->apply_to_builder(builder, _properties->make_schema_extensions(db.extensions()), db, keyspace(), true);
}
void create_table_statement::add_column_metadata_from_aliases(schema_builder& builder, std::vector<bytes> aliases, const std::vector<data_type>& types, column_kind kind) const

View File

@@ -152,9 +152,13 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
schema_ptr schema = validation::validate_column_family(db, _base_name.get_keyspace(), _base_name.get_column_family());
if (!db.features().views_with_tablets && db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on base tables with tablets"));
try {
db::view::validate_view_keyspace(db, keyspace());
} catch (const std::exception& e) {
// The type of the thrown exception is not specified, so we need to wrap it here.
throw exceptions::invalid_request_exception(e.what());
}
if (schema->is_counter()) {
throw exceptions::invalid_request_exception(format("Materialized views are not supported on counter tables"));
}
@@ -369,7 +373,30 @@ std::pair<view_ptr, cql3::cql_warnings_vec> create_view_statement::prepare_view(
db::view::create_virtual_column(builder, def->name(), def->type);
}
}
_properties.properties()->apply_to_builder(builder, std::move(schema_extensions), db, keyspace());
bool is_colocated = [&] {
if (!db.find_keyspace(keyspace()).get_replication_strategy().uses_tablets()) {
return false;
}
if (target_partition_keys.size() != schema->partition_key_columns().size()) {
return false;
}
for (size_t i = 0; i < target_partition_keys.size(); ++i) {
if (target_partition_keys[i] != &schema->partition_key_columns()[i]) {
return false;
}
}
return true;
}();
if (is_colocated) {
auto gc_opts = _properties.properties()->get_tombstone_gc_options(schema_extensions);
if (gc_opts && gc_opts->mode() == tombstone_gc_mode::repair) {
throw exceptions::invalid_request_exception("The 'repair' mode for tombstone_gc is not allowed on co-located materialized view tables.");
}
}
_properties.properties()->apply_to_builder(builder, std::move(schema_extensions), db, keyspace(), !is_colocated);
if (builder.default_time_to_live().count() > 0) {
throw exceptions::invalid_request_exception(

View File

@@ -23,6 +23,7 @@
#include "index/vector_index.hh"
#include "schema/schema.hh"
#include "service/client_state.hh"
#include "service/paxos/paxos_state.hh"
#include "types/types.hh"
#include "cql3/query_processor.hh"
#include "cql3/cql_statement.hh"
@@ -329,6 +330,19 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
} else if (service::paxos::paxos_store::try_get_base_table(name)) {
// Paxos state table is internally managed by Scylla and it shouldn't be exposed to the user.
// The table is allowed to be described as a comment to ease administrative work but it's hidden from all listings.
fragmented_ostringstream os{};
fmt::format_to(os.to_iter(),
"/* Do NOT execute this statement! It's only for informational purposes.\n"
" A paxos state table is created automatically when enabling LWT on a base table.\n"
"\n{}\n"
"*/",
*table_desc.create_statement);
table_desc.create_statement = std::move(os).to_managed_string();
}
result.push_back(std::move(table_desc));
@@ -364,7 +378,7 @@ future<std::vector<description>> table(const data_dictionary::database& db, cons
future<std::vector<description>> tables(const data_dictionary::database& db, const lw_shared_ptr<keyspace_metadata>& ks, std::optional<bool> with_internals = std::nullopt) {
auto& replica_db = db.real_database();
auto tables = ks->tables() | std::views::filter([&replica_db] (const schema_ptr& s) {
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name());
return !cdc::is_log_for_some_table(replica_db, s->ks_name(), s->cf_name()) && !service::paxos::paxos_store::try_get_base_table(s->cf_name());
}) | std::ranges::to<std::vector<schema_ptr>>();
std::ranges::sort(tables, std::ranges::less(), std::mem_fn(&schema::cf_name));

View File

@@ -21,6 +21,7 @@
#include "exceptions/exceptions.hh"
#include <seastar/core/future.hh>
#include <seastar/coroutine/exception.hh>
#include "index/vector_index.hh"
#include "service/broadcast_tables/experimental/lang.hh"
#include "service/qos/qos_common.hh"
#include "vector_search/vector_store_client.hh"
@@ -245,7 +246,9 @@ future<> select_statement::check_access(query_processor& qp, const service::clie
auto& cf_name = s->is_view()
? s->view_info()->base_name()
: (cdc ? cdc->cf_name() : column_family());
co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT);
const schema_ptr& base_schema = cdc ? cdc : _schema;
bool is_vector_indexed = secondary_index::vector_index::has_vector_index(*base_schema);
co_await state.has_column_family_access(keyspace(), cf_name, auth::permission::SELECT, auth::command_desc::type::OTHER, is_vector_indexed);
} catch (const data_dictionary::no_such_column_family& e) {
// Will be validated afterwards.
co_return;
@@ -1026,7 +1029,7 @@ indexed_table_select_statement::prepare(data_dictionary::database db,
if (it == indexes.end()) {
throw exceptions::invalid_request_exception("ANN ordering by vector requires the column to be indexed using 'vector_index'");
} else {
if (index_opt || parameters->allow_filtering() || restrictions->need_filtering() || check_needs_allow_filtering_anyway(*restrictions)) {
if (index_opt || parameters->allow_filtering() || !(restrictions->is_empty()) || check_needs_allow_filtering_anyway(*restrictions)) {
throw exceptions::invalid_request_exception("ANN ordering by vector does not support filtering");
}
index_opt = *it;
@@ -1182,6 +1185,11 @@ future<shared_ptr<cql_transport::messages::result_message>> indexed_table_select
if (stats) {
stats->add_latency(duration);
}
auto limit = get_limit(options, _limit);
auto page_size = options.get_page_size();
if (_prepared_ann_ordering.has_value() && page_size > 0 && (uint64_t) page_size < limit) {
result->add_warning("Paging is not supported for Vector Search queries. The entire result set has been returned.");
}
co_return result;
}
@@ -1217,11 +1225,18 @@ indexed_table_select_statement::actually_do_execute(query_processor& qp,
auto [ann_column, ann_vector_expr] = _prepared_ann_ordering.value();
auto values = value_cast<vector_type_impl::native_type>(ann_column->type->deserialize(expr::evaluate(ann_vector_expr, options).to_bytes()));
auto expr_value = expr::evaluate(ann_vector_expr, options);
if (expr_value.is_null()) {
throw exceptions::invalid_request_exception(fmt::format("Unsupported null value for column {}", _prepared_ann_ordering->first->name_as_text()));
}
auto values = value_cast<vector_type_impl::native_type>(ann_column->type->deserialize(std::move(expr_value).to_bytes()));
auto ann_vector = util::to_vector<float>(values);
auto as = abort_source();
auto pkeys = co_await qp.vector_store_client().ann(_schema->ks_name(), _index.metadata().name(), _schema , std::move(ann_vector), limit, as);
auto timeout = db::timeout_clock::now() + get_timeout(state.get_client_state(), options);
auto aoe = abort_on_expiry(timeout);
auto pkeys = co_await qp.vector_store_client().ann(_schema->ks_name(), _index.metadata().name(), _schema , std::move(ann_vector), limit, aoe.abort_source());
if (!pkeys.has_value()) {
co_await coroutine::return_exception(
exceptions::invalid_request_exception(std::visit(vector_search::vector_store_client::ann_error_visitor{}, pkeys.error())));

View File

@@ -59,27 +59,31 @@ db::batchlog_manager::batchlog_manager(cql3::query_processor& qp, db::system_key
});
}
future<> db::batchlog_manager::do_batch_log_replay(post_replay_cleanup cleanup) {
return container().invoke_on(0, [cleanup] (auto& bm) -> future<> {
future<db::all_batches_replayed> db::batchlog_manager::do_batch_log_replay(post_replay_cleanup cleanup) {
return container().invoke_on(0, [cleanup] (auto& bm) -> future<db::all_batches_replayed> {
auto gate_holder = bm._gate.hold();
auto sem_units = co_await get_units(bm._sem, 1);
auto dest = bm._cpu++ % smp::count;
blogger.debug("Batchlog replay on shard {}: starts", dest);
auto last_replay = gc_clock::now();
all_batches_replayed all_replayed = all_batches_replayed::yes;
if (dest == 0) {
co_await bm.replay_all_failed_batches(cleanup);
all_replayed = co_await bm.replay_all_failed_batches(cleanup);
} else {
co_await bm.container().invoke_on(dest, [cleanup] (auto& bm) {
all_replayed = co_await bm.container().invoke_on(dest, [cleanup] (auto& bm) {
return with_gate(bm._gate, [&bm, cleanup] {
return bm.replay_all_failed_batches(cleanup);
});
});
}
co_await bm.container().invoke_on_all([last_replay] (auto& bm) {
bm._last_replay = last_replay;
});
if (all_replayed == all_batches_replayed::yes) {
co_await bm.container().invoke_on_all([last_replay] (auto& bm) {
bm._last_replay = last_replay;
});
}
blogger.debug("Batchlog replay on shard {}: done", dest);
co_return all_replayed;
});
}
@@ -159,124 +163,127 @@ db_clock::duration db::batchlog_manager::get_batch_log_timeout() const {
return _write_request_timeout * 2;
}
future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
future<db::all_batches_replayed> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cleanup) {
typedef db_clock::rep clock_type;
db::all_batches_replayed all_replayed = all_batches_replayed::yes;
// rate limit is in bytes per second. Uses Double.MAX_VALUE if disabled (set to 0 in cassandra.yaml).
// max rate is scaled by the number of nodes in the cluster (same as for HHOM - see CASSANDRA-5272).
auto throttle = _replay_rate / _qp.proxy().get_token_metadata_ptr()->count_normal_token_owners();
auto limiter = make_lw_shared<utils::rate_limiter>(throttle);
auto batch = [this, limiter](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto delete_batch = [this, schema = std::move(schema)] (utils::UUID id) {
auto key = partition_key::from_singular(*schema, id);
mutation m(schema, key);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
};
auto batch = [this, limiter, delete_batch = std::move(delete_batch), &all_replayed](const cql3::untyped_result_set::row& row) -> future<stop_iteration> {
auto written_at = row.get_as<db_clock::time_point>("written_at");
auto id = row.get_as<utils::UUID>("id");
// enough time for the actual write + batchlog entry mutation delivery (two separate requests).
auto now = db_clock::now();
auto timeout = get_batch_log_timeout();
if (db_clock::now() < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
return make_ready_future<stop_iteration>(stop_iteration::no);
}
if (utils::get_local_injector().is_enabled("skip_batch_replay")) {
blogger.debug("Skipping batch replay due to skip_batch_replay injection");
return make_ready_future<stop_iteration>(stop_iteration::no);
all_replayed = all_batches_replayed::no;
co_return stop_iteration::no;
}
// check version of serialization format
if (!row.has("version")) {
blogger.warn("Skipping logged batch because of unknown version");
return make_ready_future<stop_iteration>(stop_iteration::no);
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto version = row.get_as<int32_t>("version");
if (version != netw::messaging_service::current_version) {
blogger.warn("Skipping logged batch because of incorrect version");
return make_ready_future<stop_iteration>(stop_iteration::no);
blogger.warn("Skipping logged batch because of incorrect version {}; current version = {}", version, netw::messaging_service::current_version);
co_await delete_batch(id);
co_return stop_iteration::no;
}
auto data = row.get_blob_unfragmented("data");
blogger.debug("Replaying batch {}", id);
auto fms = make_lw_shared<std::deque<canonical_mutation>>();
auto in = ser::as_input_stream(data);
while (in.size()) {
fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
}
auto size = data.size();
return map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
},
utils::chunked_vector<mutation>(),
[this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
if (fm) {
schema_ptr s = _qp.db().find_schema(fm->column_family_id());
mutations.emplace_back(fm->to_mutation(s));
try {
auto fms = make_lw_shared<std::deque<canonical_mutation>>();
auto in = ser::as_input_stream(data);
while (in.size()) {
fms->emplace_back(ser::deserialize(in, std::type_identity<canonical_mutation>()));
schema_ptr s = _qp.db().find_schema(fms->back().column_family_id());
timeout = std::min(timeout, std::chrono::duration_cast<db_clock::duration>(s->tombstone_gc_options().propagation_delay_in_seconds()));
}
return mutations;
}).then([this, limiter, written_at, size, fms] (utils::chunked_vector<mutation> mutations) {
if (mutations.empty()) {
return make_ready_future<>();
if (now < written_at + timeout) {
blogger.debug("Skipping replay of {}, too fresh", id);
co_return stop_iteration::no;
}
const auto ttl = [written_at]() -> clock_type {
/*
* Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
* This ensures that deletes aren't "undone" by an old batch replay.
*/
auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
warn(unimplemented::cause::HINT);
#if 0
for (auto& m : *mutations) {
unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
auto size = data.size();
auto mutations = co_await map_reduce(*fms, [this, written_at] (canonical_mutation& fm) {
const auto& cf = _qp.proxy().local_db().find_column_family(fm.column_family_id());
return make_ready_future<canonical_mutation*>(written_at > cf.get_truncation_time() ? &fm : nullptr);
},
utils::chunked_vector<mutation>(),
[this] (utils::chunked_vector<mutation> mutations, canonical_mutation* fm) {
if (fm) {
schema_ptr s = _qp.db().find_schema(fm->column_family_id());
mutations.emplace_back(fm->to_mutation(s));
}
#endif
return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
}();
if (ttl <= 0) {
return make_ready_future<>();
}
// Origin does the send manually, however I can't see a super great reason to do so.
// Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
// in both cases.
// FIXME: verify that the above is reasonably true.
return limiter->reserve(size).then([this, mutations = std::move(mutations)] {
_stats.write_attempts += mutations.size();
// #1222 - change cl level to ALL, emulating origins behaviour of sending/hinting
// to all natural end points.
// Note however that origin uses hints here, and actually allows for this
// send to partially or wholly fail in actually sending stuff. Since we don't
// have hints (yet), send with CL=ALL, and hope we can re-do this soon.
// See below, we use retry on write failure.
auto timeout = db::timeout_clock::now() + write_timeout;
return _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
return mutations;
});
}).then_wrapped([this, id](future<> batch_result) {
try {
batch_result.get();
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (const data_dictionary::no_such_column_family&) {
// As above -- we should drop the batch if the table doesn't exist anymore.
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
// timeout, overload etc.
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
return make_ready_future<>();
if (!mutations.empty()) {
const auto ttl = [written_at]() -> clock_type {
/*
* Calculate ttl for the mutations' hints (and reduce ttl by the time the mutations spent in the batchlog).
* This ensures that deletes aren't "undone" by an old batch replay.
*/
auto unadjusted_ttl = std::numeric_limits<gc_clock::rep>::max();
warn(unimplemented::cause::HINT);
#if 0
for (auto& m : *mutations) {
unadjustedTTL = Math.min(unadjustedTTL, HintedHandOffManager.calculateHintTTL(mutation));
}
#endif
return unadjusted_ttl - std::chrono::duration_cast<gc_clock::duration>(db_clock::now() - written_at).count();
}();
if (ttl > 0) {
// Origin does the send manually, however I can't see a super great reason to do so.
// Our normal write path does not add much redundancy to the dispatch, and rate is handled after send
// in both cases.
// FIXME: verify that the above is reasonably true.
co_await limiter->reserve(size);
_stats.write_attempts += mutations.size();
auto timeout = db::timeout_clock::now() + write_timeout;
co_await _qp.proxy().send_batchlog_replay_to_all_replicas(std::move(mutations), timeout);
}
}
// delete batch
auto schema = _qp.db().find_schema(system_keyspace::NAME, system_keyspace::BATCHLOG);
auto key = partition_key::from_singular(*schema, id);
mutation m(schema, key);
auto now = service::client_state(service::client_state::internal_tag()).get_timestamp();
m.partition().apply_delete(*schema, clustering_key_prefix::make_empty(), tombstone(now, gc_clock::now()));
return _qp.proxy().mutate_locally(m, tracing::trace_state_ptr(), db::commitlog::force_sync::no);
}).then([] { return make_ready_future<stop_iteration>(stop_iteration::no); });
} catch (data_dictionary::no_such_keyspace& ex) {
// should probably ignore and drop the batch
} catch (const data_dictionary::no_such_column_family&) {
// As above -- we should drop the batch if the table doesn't exist anymore.
} catch (...) {
blogger.warn("Replay failed (will retry): {}", std::current_exception());
all_replayed = all_batches_replayed::no;
// timeout, overload etc.
// Do _not_ remove the batch, assuning we got a node write error.
// Since we don't have hints (which origin is satisfied with),
// we have to resort to keeping this batch to next lap.
co_return stop_iteration::no;
}
// delete batch
co_await delete_batch(id);
co_return stop_iteration::no;
};
co_await with_gate(_gate, [this, cleanup, batch = std::move(batch)] () mutable -> future<> {
@@ -298,4 +305,6 @@ future<> db::batchlog_manager::replay_all_failed_batches(post_replay_cleanup cle
blogger.debug("Finished replayAllFailedBatches");
});
});
co_return all_replayed;
}

View File

@@ -31,6 +31,8 @@ namespace db {
class system_keyspace;
using all_batches_replayed = bool_class<struct all_batches_replayed_tag>;
struct batchlog_manager_config {
std::chrono::duration<double> write_request_timeout;
uint64_t replay_rate = std::numeric_limits<uint64_t>::max();
@@ -69,7 +71,7 @@ private:
gc_clock::time_point _last_replay;
future<> replay_all_failed_batches(post_replay_cleanup cleanup);
future<all_batches_replayed> replay_all_failed_batches(post_replay_cleanup cleanup);
public:
// Takes a QP, not a distributes. Because this object is supposed
// to be per shard and does no dispatching beyond delegating the the
@@ -80,7 +82,7 @@ public:
future<> drain();
future<> stop();
future<> do_batch_log_replay(post_replay_cleanup cleanup);
future<all_batches_replayed> do_batch_log_replay(post_replay_cleanup cleanup);
future<size_t> count_all_batches() const;
db_clock::duration get_batch_log_timeout() const;

View File

@@ -502,6 +502,9 @@ public:
void flush_segments(uint64_t size_to_remove);
void check_no_data_older_than_allowed();
// whitebox testing
std::function<future<>()> _oversized_pre_wait_memory_func;
private:
class shutdown_marker{};
@@ -1597,8 +1600,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
scope_increment_counter allocating(totals.active_allocations);
// #27992 - whitebox testing. signal we are trying to lock out
// all allocators
if (_oversized_pre_wait_memory_func) {
co_await _oversized_pre_wait_memory_func();
}
auto permit = co_await std::move(fut);
SCYLLA_ASSERT(_request_controller.available_units() == 0);
// #27992 - task reordering _can_ force the available units to negative. this is ok.
SCYLLA_ASSERT(_request_controller.available_units() <= 0);
decltype(permit) fake_permit; // can't have allocate+sync release semaphore.
bool failed = false;
@@ -1859,13 +1869,15 @@ future<> db::commitlog::segment_manager::oversized_allocation(entry_writer& writ
}
}
}
SCYLLA_ASSERT(_request_controller.available_units() == 0);
auto avail = _request_controller.available_units();
SCYLLA_ASSERT(avail <= 0);
SCYLLA_ASSERT(permit.count() == max_request_controller_units());
auto nw = _request_controller.waiters();
permit.return_all();
// #20633 cannot guarantee controller avail is now full, since we could have had waiters when doing
// return all -> now will be less avail
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == ssize_t(max_request_controller_units()));
SCYLLA_ASSERT(nw > 0 || _request_controller.available_units() == (avail + ssize_t(max_request_controller_units())));
if (!failed) {
clogger.trace("Oversized allocation succeeded.");
@@ -1974,13 +1986,13 @@ future<> db::commitlog::segment_manager::replenish_reserve() {
}
continue;
} catch (shutdown_marker&) {
_reserve_segments.abort(std::current_exception());
break;
} catch (...) {
clogger.warn("Exception in segment reservation: {}", std::current_exception());
}
co_await sleep(100ms);
}
_reserve_segments.abort(std::make_exception_ptr(shutdown_marker()));
}
future<std::vector<db::commitlog::descriptor>>
@@ -3624,6 +3636,10 @@ db::commitlog::read_log_file(const replay_state& state, sstring filename, sstrin
auto old = pos;
pos = next_pos(off);
clogger.trace("Pos {} -> {} ({})", old, pos, off);
// #24346 check eof status whenever we move file pos.
if (pos >= file_size) {
eof = true;
}
}
future<> read_entry() {
@@ -3939,6 +3955,9 @@ void db::commitlog::update_max_data_lifetime(std::optional<uint64_t> commitlog_d
_segment_manager->cfg.commitlog_data_max_lifetime_in_seconds = commitlog_data_max_lifetime_in_seconds;
}
void db::commitlog::set_oversized_pre_wait_memory_func(std::function<future<>()> f) {
_segment_manager->_oversized_pre_wait_memory_func = std::move(f);
}
future<std::vector<sstring>> db::commitlog::get_segments_to_replay() const {
return _segment_manager->get_segments_to_replay();

View File

@@ -385,6 +385,9 @@ public:
// (Re-)set data mix lifetime.
void update_max_data_lifetime(std::optional<uint64_t> commitlog_data_max_lifetime_in_seconds);
// Whitebox testing. Do not use for production
void set_oversized_pre_wait_memory_func(std::function<future<>()>);
using commit_load_reader_func = std::function<future<>(buffer_and_replay_position)>;
class segment_error : public std::exception {};

View File

@@ -54,12 +54,14 @@ public:
uint64_t applied_mutations = 0;
uint64_t corrupt_bytes = 0;
uint64_t truncated_at = 0;
uint64_t broken_files = 0;
stats& operator+=(const stats& s) {
invalid_mutations += s.invalid_mutations;
skipped_mutations += s.skipped_mutations;
applied_mutations += s.applied_mutations;
corrupt_bytes += s.corrupt_bytes;
broken_files += s.broken_files;
return *this;
}
stats operator+(const stats& s) const {
@@ -192,6 +194,8 @@ db::commitlog_replayer::impl::recover(const commitlog::descriptor& d, const comm
s->corrupt_bytes += e.bytes();
} catch (commitlog::segment_truncation& e) {
s->truncated_at = e.position();
} catch (commitlog::header_checksum_error&) {
++s->broken_files;
} catch (...) {
throw;
}
@@ -370,6 +374,9 @@ future<> db::commitlog_replayer::recover(std::vector<sstring> files, sstring fna
if (stats.truncated_at != 0) {
rlogger.warn("Truncated file: {} at position {}.", f, stats.truncated_at);
}
if (stats.broken_files != 0) {
rlogger.warn("Corrupted file header: {}. Skipped.", f);
}
rlogger.debug("Log replay of {} complete, {} replayed mutations ({} invalid, {} skipped)"
, f
, stats.applied_mutations

View File

@@ -1171,6 +1171,17 @@ db::config::config(std::shared_ptr<db::extensions> exts)
"* default_weight: (Default: 1 **) How many requests are handled during each turn of the RoundRobin.\n"
"* weights: (Default: Keyspace: 1) Takes a list of keyspaces. It sets how many requests are handled during each turn of the RoundRobin, based on the request_scheduler_id.")
/**
* @Group Vector search settings
* @GroupDescription Settings for configuring and tuning vector search functionality.
*/
, vector_store_primary_uri(this, "vector_store_primary_uri", liveness::LiveUpdate, value_status::Used, "",
"A comma-separated list of primary vector store node URIs. These nodes are preferred for vector search operations.")
, vector_store_secondary_uri(this, "vector_store_secondary_uri", liveness::LiveUpdate, value_status::Used, "",
"A comma-separated list of secondary vector store node URIs. These nodes are used as a fallback when all primary nodes are unavailable, and are typically located in a different availability zone for high availability.")
, vector_store_encryption_options(this, "vector_store_encryption_options", value_status::Used, {},
"Options for encrypted connections to the vector store. These options are used for HTTPS URIs in `vector_store_primary_uri` and `vector_store_secondary_uri`. The available options are:\n"
"* truststore: (Default: <not set, use system truststore>) Location of the truststore containing the trusted certificate for authenticating remote servers.")
/**
* @Group Security properties
* @GroupDescription Server and client security settings.
*/
@@ -1318,15 +1329,15 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, enable_sstables_mc_format(this, "enable_sstables_mc_format", value_status::Unused, true, "Enable SSTables 'mc' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, enable_sstables_md_format(this, "enable_sstables_md_format", value_status::Unused, true, "Enable SSTables 'md' format to be used as the default file format. Deprecated, please use \"sstable_format\" instead.")
, sstable_format(this, "sstable_format", liveness::LiveUpdate, value_status::Used, "me", "Default sstable file format", {"md", "me", "ms"})
, sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{},
, sstable_compression_user_table_options(this, "sstable_compression_user_table_options", value_status::Used, compression_parameters{compression_parameters::algorithm::lz4_with_dicts},
"Server-global user table compression options. If enabled, all user tables"
"will be compressed using the provided options, unless overridden"
"by compression options in the table schema. The available options are:\n"
"* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor (default), LZ4WithDictsCompressor, SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
"by compression options in the table schema. User tables are all tables in non-system keyspaces. The available options are:\n"
"* sstable_compression: The compression algorithm to use. Supported values: LZ4Compressor, LZ4WithDictsCompressor (default), SnappyCompressor, DeflateCompressor, ZstdCompressor, ZstdWithDictsCompressor, '' (empty string; disables compression).\n"
"* chunk_length_in_kb: (Default: 4) The size of chunks to compress in kilobytes. Allowed values are powers of two between 1 and 128.\n"
"* crc_check_chance: (Default: 1.0) Not implemented (option value is ignored).\n"
"* compression_level: (Default: 3) Compression level for ZstdCompressor and ZstdWithDictsCompressor. Higher levels provide better compression ratios at the cost of speed. Allowed values are integers between 1 and 22.")
, sstable_compression_dictionaries_allow_in_ddl(this, "sstable_compression_dictionaries_allow_in_ddl", liveness::LiveUpdate, value_status::Used, true,
, sstable_compression_dictionaries_allow_in_ddl(this, "sstable_compression_dictionaries_allow_in_ddl", liveness::LiveUpdate, value_status::Deprecated, true,
"Allows for configuring tables to use SSTable compression with shared dictionaries. "
"If the option is disabled, Scylla will reject CREATE and ALTER statements which try to set dictionary-based sstable compressors.\n"
"This is only enforced when this node validates a new DDL statement; disabling the option won't disable dictionary-based compression "
@@ -1426,7 +1437,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, alternator_port(this, "alternator_port", value_status::Used, 0, "Alternator API port.")
, alternator_https_port(this, "alternator_https_port", value_status::Used, 0, "Alternator API HTTPS port.")
, alternator_address(this, "alternator_address", value_status::Used, "0.0.0.0", "Alternator API listening address.")
, alternator_enforce_authorization(this, "alternator_enforce_authorization", value_status::Used, false, "Enforce checking the authorization header for every request in Alternator.")
, alternator_enforce_authorization(this, "alternator_enforce_authorization", liveness::LiveUpdate, value_status::Used, false, "Enforce checking the authorization header for every request in Alternator.")
, alternator_warn_authorization(this, "alternator_warn_authorization", liveness::LiveUpdate, value_status::Used, false, "Count and log warnings about failed authentication or authorization")
, alternator_write_isolation(this, "alternator_write_isolation", value_status::Used, "", "Default write isolation policy for Alternator.")
, alternator_streams_time_window_s(this, "alternator_streams_time_window_s", value_status::Used, 10, "CDC query confidence window for alternator streams.")
, alternator_timeout_in_ms(this, "alternator_timeout_in_ms", liveness::LiveUpdate, value_status::Used, 10000,
@@ -1448,7 +1460,8 @@ db::config::config(std::shared_ptr<db::extensions> exts)
false,
"Allow writing to system tables using the .scylla.alternator.system prefix")
, alternator_max_expression_cache_entries_per_shard(this, "alternator_max_expression_cache_entries_per_shard", liveness::LiveUpdate, value_status::Used, 2000, "Maximum number of cached parsed request expressions, per shard.")
, vector_store_primary_uri(this, "vector_store_primary_uri", liveness::LiveUpdate, value_status::Used, "", "A comma-separated list of vector store node URIs. If not set, vector search is disabled.")
, alternator_max_users_query_size_in_trace_output(this, "alternator_max_users_query_size_in_trace_output", liveness::LiveUpdate, value_status::Used, uint64_t(4096),
"Maximum size of user's command in trace output (`alternator_op` entry). Larger traces will be truncated and have `<truncated>` message appended - which doesn't count to the maximum limit.")
, abort_on_ebadf(this, "abort_on_ebadf", value_status::Used, true, "Abort the server on incorrect file descriptor access. Throws exception when disabled.")
, sanitizer_report_backtrace(this, "sanitizer_report_backtrace", value_status::Used, false,
"In debug mode, report log-structured allocator sanitizer violations with a backtrace. Slow.")
@@ -1524,9 +1537,9 @@ db::config::config(std::shared_ptr<db::extensions> exts)
, error_injections_at_startup(this, "error_injections_at_startup", error_injection_value_status, {}, "List of error injections that should be enabled on startup.")
, topology_barrier_stall_detector_threshold_seconds(this, "topology_barrier_stall_detector_threshold_seconds", value_status::Used, 2, "Report sites blocking topology barrier if it takes longer than this.")
, enable_tablets(this, "enable_tablets", value_status::Used, false, "Enable tablets for newly created keyspaces. (deprecated)")
, tablets_mode_for_new_keyspaces(this, "tablets_mode_for_new_keyspaces", value_status::Used, tablets_mode_t::mode::unset, "Control tablets for new keyspaces. Can be set to the following values:\n"
, tablets_mode_for_new_keyspaces(this, "tablets_mode_for_new_keyspaces", liveness::LiveUpdate, value_status::Used, tablets_mode_t::mode::unset, "Control tablets for new keyspaces. Can be set to the following values:\n"
"\tdisabled: New keyspaces use vnodes by default, unless enabled by the tablets={'enabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'disabled':true} option\n"
"\tenabled: New keyspaces use tablets by default, unless disabled by the tablets={'enabled':false} option\n"
"\tenforced: New keyspaces must use tablets. Tablets cannot be disabled using the CREATE KEYSPACE option")
, view_flow_control_delay_limit_in_ms(this, "view_flow_control_delay_limit_in_ms", liveness::LiveUpdate, value_status::Used, 1000,
"The maximal amount of time that materialized-view update flow control may delay responses "
@@ -1740,6 +1753,21 @@ const db::extensions& db::config::extensions() const {
return *_extensions;
}
compression_parameters db::config::get_sstable_compression_user_table_options(bool dicts_feature_enabled) const {
if (sstable_compression_user_table_options.is_set()
|| dicts_feature_enabled
|| !sstable_compression_user_table_options().uses_dictionary_compressor()) {
return sstable_compression_user_table_options();
} else {
// Fall back to non-dict if dictionary compression is not enabled cluster-wide.
auto options = sstable_compression_user_table_options();
auto params = options.get_options();
auto algo = compression_parameters::non_dict_equivalent(options.get_algorithm());
params[compression_parameters::SSTABLE_COMPRESSION] = sstring(compression_parameters::algorithm_to_name(algo));
return compression_parameters{params};
}
}
std::map<sstring, db::experimental_features_t::feature> db::experimental_features_t::map() {
// We decided against using the construct-on-first-use idiom here:
// https://github.com/scylladb/scylla/pull/5369#discussion_r353614807
@@ -1756,7 +1784,7 @@ std::map<sstring, db::experimental_features_t::feature> db::experimental_feature
{"broadcast-tables", feature::BROADCAST_TABLES},
{"keyspace-storage-options", feature::KEYSPACE_STORAGE_OPTIONS},
{"tablets", feature::UNUSED},
{"views-with-tablets", feature::VIEWS_WITH_TABLETS}
{"views-with-tablets", feature::UNUSED}
};
}

View File

@@ -136,8 +136,7 @@ struct experimental_features_t {
UDF,
ALTERNATOR_STREAMS,
BROADCAST_TABLES,
KEYSPACE_STORAGE_OPTIONS,
VIEWS_WITH_TABLETS
KEYSPACE_STORAGE_OPTIONS
};
static std::map<sstring, feature> map(); // See enum_option.
static std::vector<enum_option<experimental_features_t>> all();
@@ -364,6 +363,9 @@ public:
named_value<sstring> request_scheduler;
named_value<sstring> request_scheduler_id;
named_value<string_map> request_scheduler_options;
named_value<sstring> vector_store_primary_uri;
named_value<sstring> vector_store_secondary_uri;
named_value<string_map> vector_store_encryption_options;
named_value<sstring> authenticator;
named_value<sstring> internode_authenticator;
named_value<sstring> authorizer;
@@ -432,7 +434,13 @@ public:
named_value<bool> enable_sstables_mc_format;
named_value<bool> enable_sstables_md_format;
named_value<sstring> sstable_format;
// NOTE: Do not use this option directly.
// Use get_sstable_compression_user_table_options() instead.
named_value<compression_parameters> sstable_compression_user_table_options;
compression_parameters get_sstable_compression_user_table_options(bool dicts_feature_enabled) const;
named_value<bool> sstable_compression_dictionaries_allow_in_ddl;
named_value<bool> sstable_compression_dictionaries_enable_writing;
named_value<float> sstable_compression_dictionaries_memory_budget_fraction;
@@ -478,6 +486,7 @@ public:
named_value<uint16_t> alternator_https_port;
named_value<sstring> alternator_address;
named_value<bool> alternator_enforce_authorization;
named_value<bool> alternator_warn_authorization;
named_value<sstring> alternator_write_isolation;
named_value<uint32_t> alternator_streams_time_window_s;
named_value<uint32_t> alternator_timeout_in_ms;
@@ -486,8 +495,7 @@ public:
named_value<uint32_t> alternator_max_items_in_batch_write;
named_value<bool> alternator_allow_system_table_write;
named_value<uint32_t> alternator_max_expression_cache_entries_per_shard;
named_value<sstring> vector_store_primary_uri;
named_value<uint64_t> alternator_max_users_query_size_in_trace_output;
named_value<bool> abort_on_ebadf;

View File

@@ -248,7 +248,7 @@ future<db::commitlog> hint_endpoint_manager::add_store() noexcept {
// which is larger than the segment ID of the RP of the last written hint.
cfg.base_segment_id = _last_written_rp.base_id();
return commitlog::create_commitlog(std::move(cfg)).then([this] (commitlog l) -> future<commitlog> {
return commitlog::create_commitlog(std::move(cfg)).then([this] (this auto, commitlog l) -> future<commitlog> {
// add_store() is triggered every time hint files are forcefully flushed to I/O (every hints_flush_period).
// When this happens we want to refill _sender's segments only if it has finished with the segments he had before.
if (_sender.have_segments()) {

View File

@@ -643,6 +643,12 @@ future<> manager::drain_for(endpoint_id host_id, gms::inet_address ip) noexcept
co_return;
}
if (!replay_allowed()) {
auto reason = seastar::format("Precondition violdated while trying to drain {} / {}: "
"hint replay is not allowed", host_id, ip);
on_internal_error(manager_logger, std::move(reason));
}
manager_logger.info("Draining starts for {}", host_id);
const auto holder = seastar::gate::holder{_draining_eps_gate};

View File

@@ -318,6 +318,10 @@ public:
/// In both cases - removes the corresponding hints' directories after all hints have been drained and erases the
/// corresponding hint_endpoint_manager objects.
///
/// Preconditions:
/// * Hint replay must be allowed (i.e. `replay_allowed()` must be true) throughout
/// the execution of this function.
///
/// \param host_id host ID of the node that left the cluster
/// \param ip the IP of the node that left the cluster
future<> drain_for(endpoint_id host_id, gms::inet_address ip) noexcept;
@@ -342,15 +346,15 @@ public:
return _state.contains(state::started);
}
bool replay_allowed() const noexcept {
return _state.contains(state::replay_allowed);
}
private:
void set_started() noexcept {
_state.set(state::started);
}
bool replay_allowed() const noexcept {
return _state.contains(state::replay_allowed);
}
void set_draining_all() noexcept {
_state.set(state::draining_all);
}

View File

@@ -850,7 +850,7 @@ mutation_reader row_cache::make_nonpopulating_reader(schema_ptr schema, reader_p
std::move(permit),
e.key(),
query::clustering_key_filter_ranges(slice.row_ranges(*schema, e.key().key())),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), nullptr, phase_of(pos)),
e.partition().read(_tracker.region(), _tracker.memtable_cleaner(), &_tracker, phase_of(pos)),
false,
_tracker.region(),
_read_section,

View File

@@ -95,16 +95,16 @@ static logging::logger diff_logger("schema_diff");
/** system.schema_* tables used to store keyspace/table/type attributes prior to C* 3.0 */
namespace db {
namespace {
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
props.enable_schema_commitlog();
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if (ks_name == schema_tables::NAME) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
if (builder.ks_name() == schema_tables::NAME) {
// all schema tables are group0 tables
props.is_group0_table = true;
builder.set_is_group0_table();
}
});
}
@@ -1911,7 +1911,7 @@ static void make_update_indices_mutations(
if (!view_should_exist(index)) {
return view_ptr(nullptr);
}
auto view = cf.get_index_manager().create_view_for_index(index);
auto view = cf.get_index_manager().create_view_for_index(index, db.as_data_dictionary());
auto view_mutations = make_view_mutations(view, timestamp, true);
view_mutations.copy_to(mutations);
return view;
@@ -1945,7 +1945,7 @@ static void make_update_indices_mutations(
for (auto& replica: tablet_map.get_tablet_info(tid).replicas) {
auto id = utils::UUID_gen::get_time_UUID();
view::view_building_task task {
id, view::view_building_task::task_type::build_range, view::view_building_task::task_state::idle,
id, view::view_building_task::task_type::build_range, false,
new_table->id(), view->id(), replica, last_token
};

View File

@@ -42,11 +42,11 @@ extern logging::logger cdc_log;
namespace db {
namespace {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
if ((ks_name == system_distributed_keyspace::NAME_EVERYWHERE && cf_name == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(ks_name == system_distributed_keyspace::NAME && cf_name == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
if ((builder.ks_name() == system_distributed_keyspace::NAME_EVERYWHERE && builder.cf_name() == system_distributed_keyspace::CDC_GENERATIONS_V2) ||
(builder.ks_name() == system_distributed_keyspace::NAME && builder.cf_name() == system_distributed_keyspace::CDC_TOPOLOGY_DESCRIPTION))
{
props.wait_for_sync_to_commitlog = true;
builder.set_wait_for_sync_to_commitlog(true);
}
});
}

View File

@@ -55,6 +55,7 @@
#include "utils/shared_dict.hh"
#include "replica/database.hh"
#include "db/compaction_history_entry.hh"
#include "mutation/async_utils.hh"
#include <unordered_map>
@@ -65,59 +66,44 @@ static thread_local auto sstableinfo_type = user_type_impl::get_instance(
namespace db {
namespace {
const auto set_null_sharder = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_null_sharder = schema_builder::register_schema_initializer([](schema_builder& builder) {
// tables in the "system" keyspace which need to use null sharder
static const std::unordered_set<sstring> tables = {
// empty
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.use_null_sharder = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_use_null_sharder(true);
}
});
const auto set_wait_for_sync_to_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_wait_for_sync_to_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
system_keyspace::PAXOS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.wait_for_sync_to_commitlog = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_wait_for_sync_to_commitlog(true);
}
});
const auto set_use_schema_commitlog = schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
const auto set_use_schema_commitlog = schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
schema_tables::SCYLLA_TABLE_SCHEMA_HISTORY,
system_keyspace::BROADCAST_KV_STORE,
system_keyspace::CDC_GENERATIONS_V3,
system_keyspace::RAFT,
system_keyspace::RAFT_SNAPSHOTS,
system_keyspace::RAFT_SNAPSHOT_CONFIG,
system_keyspace::GROUP0_HISTORY,
system_keyspace::DISCOVERY,
system_keyspace::TABLETS,
system_keyspace::TOPOLOGY,
system_keyspace::TOPOLOGY_REQUESTS,
system_keyspace::LOCAL,
system_keyspace::PEERS,
system_keyspace::SCYLLA_LOCAL,
system_keyspace::COMMITLOG_CLEANUPS,
system_keyspace::SERVICE_LEVELS_V2,
system_keyspace::VIEW_BUILD_STATUS_V2,
system_keyspace::CDC_STREAMS_STATE,
system_keyspace::CDC_STREAMS_HISTORY,
system_keyspace::ROLES,
system_keyspace::ROLE_MEMBERS,
system_keyspace::ROLE_ATTRIBUTES,
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::v3::CDC_LOCAL,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.enable_schema_commitlog();
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.enable_schema_commitlog();
}
});
const auto set_group0_table_options =
schema_builder::register_static_configurator([](const sstring& ks_name, const sstring& cf_name, schema_static_props& props) {
schema_builder::register_schema_initializer([](schema_builder& builder) {
static const std::unordered_set<sstring> tables = {
// scylla_local may store a replicated tombstone related to schema
// (see `make_group0_schema_version_mutation`), so we include it in the group0 tables list.
@@ -137,9 +123,10 @@ namespace {
system_keyspace::ROLE_PERMISSIONS,
system_keyspace::DICTS,
system_keyspace::VIEW_BUILDING_TASKS,
system_keyspace::REPAIR_TASKS,
};
if (ks_name == system_keyspace::NAME && tables.contains(cf_name)) {
props.is_group0_table = true;
if (builder.ks_name() == system_keyspace::NAME && tables.contains(builder.cf_name())) {
builder.set_is_group0_table();
}
});
}
@@ -462,6 +449,24 @@ schema_ptr system_keyspace::repair_history() {
return schema;
}
schema_ptr system_keyspace::repair_tasks() {
static thread_local auto schema = [] {
auto id = generate_legacy_id(NAME, REPAIR_TASKS);
return schema_builder(NAME, REPAIR_TASKS, std::optional(id))
.with_column("task_uuid", uuid_type, column_kind::partition_key)
.with_column("operation", utf8_type, column_kind::clustering_key)
// First and last token for of the tablet
.with_column("first_token", long_type, column_kind::clustering_key)
.with_column("last_token", long_type, column_kind::clustering_key)
.with_column("timestamp", timestamp_type)
.with_column("table_uuid", uuid_type, column_kind::static_column)
.set_comment("Record tablet repair tasks")
.with_hash_version()
.build();
}();
return schema;
}
schema_ptr system_keyspace::built_indexes() {
static thread_local auto built_indexes = [] {
schema_builder builder(generate_legacy_id(NAME, BUILT_INDEXES), NAME, BUILT_INDEXES,
@@ -1667,7 +1672,7 @@ schema_ptr system_keyspace::view_building_tasks() {
.with_column("key", utf8_type, column_kind::partition_key)
.with_column("id", timeuuid_type, column_kind::clustering_key)
.with_column("type", utf8_type)
.with_column("state", utf8_type)
.with_column("aborted", boolean_type)
.with_column("base_id", uuid_type)
.with_column("view_id", uuid_type)
.with_column("last_token", long_type)
@@ -2463,14 +2468,14 @@ future<bool> system_keyspace::cdc_is_rewritten() {
}
future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f) {
noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f) {
static const sstring all_tables_query = format("SELECT table_id, timestamp, stream_id FROM {}.{}", NAME, CDC_STREAMS_STATE);
static const sstring single_table_query = format("SELECT table_id, timestamp, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_STATE);
struct cur_t {
table_id tid;
db_clock::time_point ts;
std::vector<cdc::stream_id> streams;
utils::chunked_vector<cdc::stream_id> streams;
};
std::optional<cur_t> cur;
@@ -2487,7 +2492,7 @@ future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
if (cur) {
co_await f(cur->tid, cur->ts, std::move(cur->streams));
}
cur = { tid, ts, std::vector<cdc::stream_id>() };
cur = { tid, ts, utils::chunked_vector<cdc::stream_id>() };
}
cur->streams.push_back(std::move(stream_id));
@@ -2499,9 +2504,10 @@ future<> system_keyspace::read_cdc_streams_state(std::optional<table_id> table,
}
}
future<> system_keyspace::read_cdc_streams_history(table_id table,
future<> system_keyspace::read_cdc_streams_history(table_id table, std::optional<db_clock::time_point> from,
noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f) {
static const sstring query = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_HISTORY);
static const sstring query_all = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ?", NAME, CDC_STREAMS_HISTORY);
static const sstring query_from = format("SELECT table_id, timestamp, stream_state, stream_id FROM {}.{} WHERE table_id = ? AND timestamp > ?", NAME, CDC_STREAMS_HISTORY);
struct cur_t {
table_id tid;
@@ -2510,7 +2516,11 @@ future<> system_keyspace::read_cdc_streams_history(table_id table,
};
std::optional<cur_t> cur;
co_await _qp.query_internal(query, db::consistency_level::ONE, {table.uuid()}, 1000, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
co_await _qp.query_internal(from ? query_from : query_all,
db::consistency_level::ONE,
from ? data_value_list{table.uuid(), *from} : data_value_list{table.uuid()},
1000,
[&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto tid = table_id(row.get_as<utils::UUID>("table_id"));
auto ts = row.get_as<db_clock::time_point>("timestamp");
auto stream_state = cdc::read_stream_state(row.get_as<int8_t>("stream_state"));
@@ -2594,6 +2604,7 @@ std::vector<schema_ptr> system_keyspace::all_tables(const db::config& cfg) {
corrupt_data(),
scylla_local(), db::schema_tables::scylla_table_schema_history(),
repair_history(),
repair_tasks(),
v3::views_builds_in_progress(), v3::built_views(),
v3::scylla_views_builds_in_progress(),
v3::truncated(),
@@ -2842,6 +2853,32 @@ future<> system_keyspace::get_repair_history(::table_id table_id, repair_history
});
}
future<utils::chunked_vector<canonical_mutation>> system_keyspace::get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts) {
// Default to timeout the repair task entries in 10 days, this should be enough time for the management tools to query
constexpr int ttl = 10 * 24 * 3600;
sstring req = format("INSERT INTO system.{} (task_uuid, operation, first_token, last_token, timestamp, table_uuid) VALUES (?, ?, ?, ?, ?, ?) USING TTL {}", REPAIR_TASKS, ttl);
auto muts = co_await _qp.get_mutations_internal(req, internal_system_query_state(), ts,
{entry.task_uuid.uuid(), repair_task_operation_to_string(entry.operation),
entry.first_token, entry.last_token, entry.timestamp, entry.table_uuid.uuid()});
utils::chunked_vector<canonical_mutation> cmuts(muts.begin(), muts.end());
co_return cmuts;
}
future<> system_keyspace::get_repair_task(tasks::task_id task_uuid, repair_task_consumer f) {
sstring req = format("SELECT * from system.{} WHERE task_uuid = {}", REPAIR_TASKS, task_uuid);
co_await _qp.query_internal(req, [&f] (const cql3::untyped_result_set::row& row) mutable -> future<stop_iteration> {
repair_task_entry ent;
ent.task_uuid = tasks::task_id(row.get_as<utils::UUID>("task_uuid"));
ent.operation = repair_task_operation_from_string(row.get_as<sstring>("operation"));
ent.first_token = row.get_as<int64_t>("first_token");
ent.last_token = row.get_as<int64_t>("last_token");
ent.timestamp = row.get_as<db_clock::time_point>("timestamp");
ent.table_uuid = ::table_id(row.get_as<utils::UUID>("table_uuid"));
co_await f(std::move(ent));
co_return stop_iteration::no;
});
}
future<gms::generation_type> system_keyspace::increment_and_get_generation() {
auto req = format("SELECT gossip_generation FROM system.{} WHERE key='{}'", LOCAL, LOCAL);
auto rs = co_await _qp.execute_internal(req, cql3::query_processor::cache_internal::yes);
@@ -3057,14 +3094,14 @@ future<mutation> system_keyspace::make_remove_view_build_status_on_host_mutation
static constexpr auto VIEW_BUILDING_KEY = "view_building";
future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
static const sstring query = format("SELECT id, type, state, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
static const sstring query = format("SELECT id, type, aborted, base_id, view_id, last_token, host_id, shard FROM {}.{} WHERE key = '{}'", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
using namespace db::view;
building_tasks tasks;
co_await _qp.query_internal(query, [&] (const cql3::untyped_result_set_row& row) -> future<stop_iteration> {
auto id = row.get_as<utils::UUID>("id");
auto type = task_type_from_string(row.get_as<sstring>("type"));
auto state = task_state_from_string(row.get_as<sstring>("state"));
auto aborted = row.get_as<bool>("aborted");
auto base_id = table_id(row.get_as<utils::UUID>("base_id"));
auto view_id = row.get_opt<utils::UUID>("view_id").transform([] (const utils::UUID& uuid) { return table_id(uuid); });
auto last_token = dht::token::from_int64(row.get_as<int64_t>("last_token"));
@@ -3072,7 +3109,7 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
auto shard = unsigned(row.get_as<int32_t>("shard"));
locator::tablet_replica replica{host_id, shard};
view_building_task task{id, type, state, base_id, view_id, replica, last_token};
view_building_task task{id, type, aborted, base_id, view_id, replica, last_token};
switch (type) {
case db::view::view_building_task::task_type::build_range:
@@ -3091,7 +3128,7 @@ future<db::view::building_tasks> system_keyspace::get_view_building_tasks() {
}
future<mutation> system_keyspace::make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task) {
static const sstring stmt = format("INSERT INTO {}.{}(key, id, type, state, base_id, view_id, last_token, host_id, shard) VALUES ('{}', ?, ?, ?, ?, ?, ?, ?, ?)", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
static const sstring stmt = format("INSERT INTO {}.{}(key, id, type, aborted, base_id, view_id, last_token, host_id, shard) VALUES ('{}', ?, ?, ?, ?, ?, ?, ?, ?)", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
using namespace db::view;
data_value_or_unset view_id = unset_value{};
@@ -3102,7 +3139,7 @@ future<mutation> system_keyspace::make_view_building_task_mutation(api::timestam
view_id = data_value(task.view_id->uuid());
}
auto muts = co_await _qp.get_mutations_internal(stmt, internal_system_query_state(), ts, {
task.id, task_type_to_sstring(task.type), task_state_to_sstring(task.state),
task.id, task_type_to_sstring(task.type), task.aborted,
task.base_id.uuid(), view_id, dht::token::to_int64(task.last_token),
task.replica.host.uuid(), int32_t(task.replica.shard)
});
@@ -3112,18 +3149,6 @@ future<mutation> system_keyspace::make_view_building_task_mutation(api::timestam
co_return std::move(muts[0]);
}
future<mutation> system_keyspace::make_update_view_building_task_state_mutation(api::timestamp_type ts, utils::UUID id, db::view::view_building_task::task_state state) {
static const sstring stmt = format("UPDATE {}.{} SET state = ? WHERE key = '{}' AND id = ?", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
auto muts = co_await _qp.get_mutations_internal(stmt, internal_system_query_state(), ts, {
task_state_to_sstring(state), id
});
if (muts.size() != 1) {
on_internal_error(slogger, fmt::format("expected 1 mutation got {}", muts.size()));
}
co_return std::move(muts[0]);
}
future<mutation> system_keyspace::make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id) {
static const sstring stmt = format("DELETE FROM {}.{} WHERE key = '{}' AND id = ?", NAME, VIEW_BUILDING_TASKS, VIEW_BUILDING_KEY);
@@ -3255,7 +3280,9 @@ future<mutation> system_keyspace::get_group0_history(sharded<replica::database>&
SCYLLA_ASSERT(rs);
auto& ps = rs->partitions();
for (auto& p: ps) {
auto mut = p.mut().unfreeze(s);
// Note: we could decorate the frozen_mutation's key to check if it's the expected one
// but since this is a single partition table, we can just check after unfreezing the whole mutation.
auto mut = co_await unfreeze_gently(p.mut(), s);
auto partition_key = value_cast<sstring>(utf8_type->deserialize(mut.key().get_component(*s, 0)));
if (partition_key == GROUP0_HISTORY_KEY) {
co_return mut;
@@ -3479,7 +3506,7 @@ future<service::topology> system_keyspace::load_topology_state(const std::unorde
supported_features = decode_features(deserialize_set_column(*topology(), row, "supported_features"));
}
if (row.has("topology_request")) {
if (row.has("topology_request") && nstate != service::node_state::left) {
auto req = service::topology_request_from_string(row.get_as<sstring>("topology_request"));
ret.requests.emplace(host_id, req);
switch(req) {
@@ -4000,4 +4027,35 @@ future<> system_keyspace::apply_mutation(mutation m) {
return _qp.proxy().mutate_locally(m, {}, db::commitlog::force_sync(m.schema()->static_props().wait_for_sync_to_commitlog), db::no_timeout);
}
// The names are persisted in system tables so should not be changed.
static const std::unordered_map<system_keyspace::repair_task_operation, sstring> repair_task_operation_to_name = {
{system_keyspace::repair_task_operation::requested, "requested"},
{system_keyspace::repair_task_operation::finished, "finished"},
};
static const std::unordered_map<sstring, system_keyspace::repair_task_operation> repair_task_operation_from_name = std::invoke([] {
std::unordered_map<sstring, system_keyspace::repair_task_operation> result;
for (auto&& [v, s] : repair_task_operation_to_name) {
result.emplace(s, v);
}
return result;
});
sstring system_keyspace::repair_task_operation_to_string(system_keyspace::repair_task_operation op) {
auto i = repair_task_operation_to_name.find(op);
if (i == repair_task_operation_to_name.end()) {
on_internal_error(slogger, format("Invalid repair task operation: {}", static_cast<int>(op)));
}
return i->second;
}
system_keyspace::repair_task_operation system_keyspace::repair_task_operation_from_string(const sstring& name) {
return repair_task_operation_from_name.at(name);
}
} // namespace db
auto fmt::formatter<db::system_keyspace::repair_task_operation>::format(const db::system_keyspace::repair_task_operation& op, fmt::format_context& ctx) const
-> decltype(ctx.out()) {
return fmt::format_to(ctx.out(), "{}", db::system_keyspace::repair_task_operation_to_string(op));
}

View File

@@ -57,6 +57,8 @@ namespace paxos {
struct topology_request_state;
class group0_guard;
class raft_group0_client;
}
namespace netw {
@@ -184,6 +186,7 @@ public:
static constexpr auto RAFT_SNAPSHOTS = "raft_snapshots";
static constexpr auto RAFT_SNAPSHOT_CONFIG = "raft_snapshot_config";
static constexpr auto REPAIR_HISTORY = "repair_history";
static constexpr auto REPAIR_TASKS = "repair_tasks";
static constexpr auto GROUP0_HISTORY = "group0_history";
static constexpr auto DISCOVERY = "discovery";
static constexpr auto BROADCAST_KV_STORE = "broadcast_kv_store";
@@ -198,6 +201,15 @@ public:
static constexpr auto VIEW_BUILD_STATUS_V2 = "view_build_status_v2";
static constexpr auto DICTS = "dicts";
static constexpr auto VIEW_BUILDING_TASKS = "view_building_tasks";
static constexpr auto VERSIONS = "versions";
static constexpr auto BATCHES = "batches";
static constexpr auto AVAILABLE_RANGES = "available_ranges";
static constexpr auto VIEWS_BUILDS_IN_PROGRESS = "views_builds_in_progress";
static constexpr auto BUILT_VIEWS = "built_views";
static constexpr auto SCYLLA_VIEWS_BUILDS_IN_PROGRESS = "scylla_views_builds_in_progress";
static constexpr auto CDC_LOCAL = "cdc_local";
static constexpr auto CDC_TIMESTAMPS = "cdc_timestamps";
static constexpr auto CDC_STREAMS = "cdc_streams";
// auth
static constexpr auto ROLES = "roles";
@@ -282,6 +294,7 @@ public:
static schema_ptr raft();
static schema_ptr raft_snapshots();
static schema_ptr repair_history();
static schema_ptr repair_tasks();
static schema_ptr group0_history();
static schema_ptr discovery();
static schema_ptr broadcast_kv_store();
@@ -420,6 +433,22 @@ public:
int64_t range_end;
};
enum class repair_task_operation {
requested,
finished,
};
static sstring repair_task_operation_to_string(repair_task_operation op);
static repair_task_operation repair_task_operation_from_string(const sstring& name);
struct repair_task_entry {
tasks::task_id task_uuid;
repair_task_operation operation;
int64_t first_token;
int64_t last_token;
db_clock::time_point timestamp;
table_id table_uuid;
};
struct topology_requests_entry {
utils::UUID id;
utils::UUID initiating_host;
@@ -441,6 +470,10 @@ public:
using repair_history_consumer = noncopyable_function<future<>(const repair_history_entry&)>;
future<> get_repair_history(table_id, repair_history_consumer f);
future<utils::chunked_vector<canonical_mutation>> get_update_repair_task_mutations(const repair_task_entry& entry, api::timestamp_type ts);
using repair_task_consumer = noncopyable_function<future<>(const repair_task_entry&)>;
future<> get_repair_task(tasks::task_id task_uuid, repair_task_consumer f);
future<> save_truncation_record(const replica::column_family&, db_clock::time_point truncated_at, db::replay_position);
future<replay_positions> get_truncated_positions(table_id);
future<> drop_truncation_rp_records();
@@ -576,7 +609,6 @@ public:
// system.view_building_tasks
future<db::view::building_tasks> get_view_building_tasks();
future<mutation> make_view_building_task_mutation(api::timestamp_type ts, const db::view::view_building_task& task);
future<mutation> make_update_view_building_task_state_mutation(api::timestamp_type ts, utils::UUID id, db::view::view_building_task::task_state state);
future<mutation> make_remove_view_building_task_mutation(api::timestamp_type ts, utils::UUID id);
// system.scylla_local, view_building_processing_base key
@@ -601,8 +633,8 @@ public:
future<bool> cdc_is_rewritten();
future<> cdc_set_rewritten(std::optional<cdc::generation_id_v1>);
future<> read_cdc_streams_state(std::optional<table_id> table, noncopyable_function<future<>(table_id, db_clock::time_point, std::vector<cdc::stream_id>)> f);
future<> read_cdc_streams_history(table_id table, noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f);
future<> read_cdc_streams_state(std::optional<table_id> table, noncopyable_function<future<>(table_id, db_clock::time_point, utils::chunked_vector<cdc::stream_id>)> f);
future<> read_cdc_streams_history(table_id table, std::optional<db_clock::time_point> from, noncopyable_function<future<>(table_id, db_clock::time_point, cdc::cdc_stream_diff)> f);
// Load Raft Group 0 id from scylla.local
future<utils::UUID> get_raft_group0_id();
@@ -746,3 +778,8 @@ public:
}; // class system_keyspace
} // namespace db
template <>
struct fmt::formatter<db::system_keyspace::repair_task_operation> : fmt::formatter<string_view> {
auto format(const db::system_keyspace::repair_task_operation&, fmt::format_context& ctx) const -> decltype(ctx.out());
};

View File

@@ -26,6 +26,7 @@
#include <seastar/coroutine/maybe_yield.hh>
#include <flat_map>
#include "db/config.hh"
#include "db/view/base_info.hh"
#include "db/view/view_build_status.hh"
#include "db/view/view_consumer.hh"
@@ -929,8 +930,7 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
const row& existing_row = existing.cells();
const row& updated_row = update.cells();
const bool base_has_nonexpiring_marker = update.marker().is_live() && !update.marker().is_expiring();
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row, base_has_nonexpiring_marker] (const column_definition& cdef) {
return std::ranges::all_of(_base->regular_columns(), [this, &updated_row, &existing_row] (const column_definition& cdef) {
const auto view_it = _view->columns_by_name().find(cdef.name());
const bool column_is_selected = view_it != _view->columns_by_name().end();
@@ -938,49 +938,29 @@ bool view_updates::can_skip_view_updates(const clustering_or_static_row& update,
// as part of its PK, there are NO virtual columns corresponding to the unselected columns in the view.
// Because of that, we don't generate view updates when the value in an unselected column is created
// or changes.
if (!column_is_selected && _base_info.has_base_non_pk_columns_in_view_pk) {
if (!column_is_selected) {
return true;
}
//TODO(sarna): Optimize collections case - currently they do not go under optimization
if (!cdef.is_atomic()) {
return false;
}
// We cannot skip if the value was created or deleted, unless we have a non-expiring marker
// We cannot skip if the value was created or deleted
const auto* existing_cell = existing_row.find_cell(cdef.id);
const auto* updated_cell = updated_row.find_cell(cdef.id);
if (existing_cell == nullptr || updated_cell == nullptr) {
return existing_cell == updated_cell || (!column_is_selected && base_has_nonexpiring_marker);
return existing_cell == updated_cell;
}
if (!cdef.is_atomic()) {
return existing_cell->as_collection_mutation().data == updated_cell->as_collection_mutation().data;
}
atomic_cell_view existing_cell_view = existing_cell->as_atomic_cell(cdef);
atomic_cell_view updated_cell_view = updated_cell->as_atomic_cell(cdef);
// We cannot skip when a selected column is changed
if (column_is_selected) {
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
if (view_it->second->is_view_virtual()) {
return atomic_cells_liveness_equal(existing_cell_view, updated_cell_view);
}
// With non-expiring row marker, liveness checks below are not relevant
if (base_has_nonexpiring_marker) {
return true;
}
if (existing_cell_view.is_live() != updated_cell_view.is_live()) {
return false;
}
// We cannot skip if the change updates TTL
const bool existing_has_ttl = existing_cell_view.is_live_and_has_ttl();
const bool updated_has_ttl = updated_cell_view.is_live_and_has_ttl();
if (existing_has_ttl || updated_has_ttl) {
return existing_has_ttl == updated_has_ttl && existing_cell_view.expiry() == updated_cell_view.expiry();
}
return true;
return compare_atomic_cell_for_merge(existing_cell_view, updated_cell_view) == 0;
});
}
@@ -3305,15 +3285,6 @@ public:
_step.base->schema()->cf_name(), _step.current_token(), view_names);
}
if (_step.reader.is_end_of_stream() && _step.reader.is_buffer_empty()) {
if (_step.current_key.key().is_empty()) {
// consumer got end-of-stream without consuming a single partition
vlogger.debug("Reader didn't produce anything, marking views as built");
while (!_step.build_status.empty()) {
_built_views.views.push_back(std::move(_step.build_status.back()));
_step.build_status.pop_back();
}
}
// before going back to the minimum token, advance current_key to the end
// and check for built views in that range.
_step.current_key = { _step.prange.end().value_or(dht::ring_position::max()).value().token(), partition_key::make_empty()};
@@ -3332,6 +3303,7 @@ public:
// Called in the context of a seastar::thread.
void view_builder::execute(build_step& step, exponential_backoff_retry r) {
inject_failure("dont_start_build_step");
gc_clock::time_point now = gc_clock::now();
auto compaction_state = make_lw_shared<compact_for_query_state>(
*step.reader.schema(),
@@ -3365,6 +3337,7 @@ void view_builder::execute(build_step& step, exponential_backoff_retry r) {
seastar::when_all_succeed(bookkeeping_ops.begin(), bookkeeping_ops.end()).handle_exception([] (std::exception_ptr ep) {
vlogger.warn("Failed to update materialized view bookkeeping ({}), continuing anyway.", ep);
}).get();
utils::get_local_injector().inject("delay_finishing_build_step", utils::wait_for_message(60s)).get();
}
future<> view_builder::mark_as_built(view_ptr view) {
@@ -3715,5 +3688,22 @@ sstring build_status_to_sstring(build_status status) {
on_internal_error(vlogger, fmt::format("Unknown view build status: {}", (int)status));
}
void validate_view_keyspace(const data_dictionary::database& db, std::string_view keyspace_name) {
const bool tablet_views_enabled = db.features().views_with_tablets;
// Note: if the configuration option `rf_rack_valid_keyspaces` is enabled, we can be
// sure that all tablet-based keyspaces are RF-rack-valid. We check that
// at start-up and then we don't allow for creating RF-rack-invalid keyspaces.
const bool rf_rack_valid_keyspaces = db.get_config().rf_rack_valid_keyspaces();
const bool required_config = tablet_views_enabled && rf_rack_valid_keyspaces;
const bool uses_tablets = db.find_keyspace(keyspace_name).get_replication_strategy().uses_tablets();
if (!required_config && uses_tablets) {
throw std::logic_error("Materialized views and secondary indexes are not supported on base tables with tablets. "
"To be able to use them, enable the configuration option `rf_rack_valid_keyspaces` and make sure "
"that the cluster feature `VIEWS_WITH_TABLETS` is enabled.");
}
}
} // namespace view
} // namespace db

View File

@@ -309,6 +309,18 @@ endpoints_to_update get_view_natural_endpoint(
bool use_tablets_basic_rack_aware_view_pairing,
replica::cf_stats& cf_stats);
/// Verify that the provided keyspace is eligible for storing materialized views.
///
/// Result:
/// * If the keyspace is eligible, no effect.
/// * If the keyspace is not eligible, an exception is thrown. Its type is not specified,
/// and the user of this function cannot make any assumption about it. The carried exception
/// message will be worded in a way that can be directly passed on to the end user.
///
/// Preconditions:
/// * The provided `keyspace_name` must correspond to an existing keyspace.
void validate_view_keyspace(const data_dictionary::database&, std::string_view keyspace_name);
}
}

View File

@@ -29,6 +29,10 @@
#include "db/view/view_building_task_mutation_builder.hh"
#include "utils/assert.hh"
#include "idl/view.dist.hh"
#include "utils/error_injection.hh"
#include "utils/log.hh"
using namespace std::chrono_literals;
static logging::logger vbc_logger("view_building_coordinator");
@@ -102,6 +106,8 @@ future<> view_building_coordinator::run() {
_vb_sm.event.broadcast();
});
auto finished_tasks_gc_fiber = finished_task_gc_fiber();
while (!_as.abort_requested()) {
co_await utils::get_local_injector().inject("view_building_coordinator_pause_main_loop", utils::wait_for_message(std::chrono::minutes(2)));
if (utils::get_local_injector().enter("view_building_coordinator_skip_main_loop")) {
@@ -119,12 +125,7 @@ future<> view_building_coordinator::run() {
continue;
}
auto started_new_work = co_await work_on_view_building(std::move(*guard_opt));
if (started_new_work) {
// If any tasks were started, do another iteration, so the coordinator can attach itself to the tasks (via RPC)
vbc_logger.debug("view building coordinator started new tasks, do next iteration without waiting for event");
continue;
}
co_await work_on_view_building(std::move(*guard_opt));
co_await await_event();
} catch (...) {
handle_coordinator_error(std::current_exception());
@@ -140,6 +141,66 @@ future<> view_building_coordinator::run() {
}
}
}
co_await std::move(finished_tasks_gc_fiber);
}
future<> view_building_coordinator::finished_task_gc_fiber() {
static auto task_gc_interval = 200ms;
while (!_as.abort_requested()) {
try {
co_await clean_finished_tasks();
co_await sleep_abortable(task_gc_interval, _as);
} catch (abort_requested_exception&) {
vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber got abort_requested_exception");
} catch (service::group0_concurrent_modification&) {
vbc_logger.info("view_building_coordinator::finished_task_gc_fiber got group0_concurrent_modification");
} catch (raft::request_aborted&) {
vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber got raft::request_aborted");
} catch (service::term_changed_error&) {
vbc_logger.debug("view_building_coordinator::finished_task_gc_fiber notices term change {} -> {}", _term, _raft.get_current_term());
} catch (raft::commit_status_unknown&) {
vbc_logger.warn("view_building_coordinator::finished_task_gc_fiber got raft::commit_status_unknown");
} catch (...) {
vbc_logger.error("view_building_coordinator::finished_task_gc_fiber got error: {}", std::current_exception());
}
}
}
future<> view_building_coordinator::clean_finished_tasks() {
// Avoid acquiring a group0 operation if there are no tasks.
if (_finished_tasks.empty()) {
co_return;
}
auto guard = co_await start_operation();
auto lock = co_await get_unique_lock(_mutex);
if (!_vb_sm.building_state.currently_processed_base_table || std::ranges::all_of(_finished_tasks, [] (auto& e) { return e.second.empty(); })) {
co_return;
}
view_building_task_mutation_builder builder(guard.write_timestamp());
for (auto& [replica, tasks]: _finished_tasks) {
for (auto& task_id: tasks) {
// The task might be aborted in the meantime. In this case we cannot remove it because we need it to create a new task.
//
// TODO: When we're aborting a view building task (for instance due to tablet migration),
// we can look if we already finished it (check if it's in `_finished_tasks`).
// If yes, we can just remove it instead of aborting it.
auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, task_id);
if (task_opt && !task_opt->get().aborted) {
builder.del_task(task_id);
vbc_logger.debug("Removing finished task with ID: {}", task_id);
}
}
}
co_await commit_mutations(std::move(guard), {builder.build()}, "remove finished view building tasks");
for (auto& [_, tasks_set]: _finished_tasks) {
tasks_set.clear();
}
}
future<std::optional<service::group0_guard>> view_building_coordinator::update_state(service::group0_guard guard) {
@@ -299,18 +360,16 @@ future<> view_building_coordinator::update_views_statuses(const service::group0_
}
}
future<bool> view_building_coordinator::work_on_view_building(service::group0_guard guard) {
future<> view_building_coordinator::work_on_view_building(service::group0_guard guard) {
if (!_vb_sm.building_state.currently_processed_base_table) {
vbc_logger.debug("No base table is selected, nothing to do.");
co_return false;
co_return;
}
utils::chunked_vector<mutation> muts;
std::unordered_set<locator::tablet_replica> _remote_work_keys_to_erase;
// Acquire unique lock of `_finished_tasks` to ensure each replica has its own entry in it
// and to select tasks for them.
auto lock = co_await get_unique_lock(_mutex);
for (auto& replica: get_replicas_with_tasks()) {
// Check whether the coordinator already waits for the remote work on the replica to be finished.
// If so: check if the work is done and and remove the shared_future, skip this replica otherwise.
bool skip_work_on_this_replica = false;
if (_remote_work.contains(replica)) {
if (!_remote_work[replica].available()) {
vbc_logger.debug("Replica {} is still doing work", replica);
@@ -318,51 +377,25 @@ future<bool> view_building_coordinator::work_on_view_building(service::group0_gu
}
auto remote_results_opt = co_await _remote_work[replica].get_future();
if (remote_results_opt) {
auto results_muts = co_await update_state_after_work_is_done(guard, replica, std::move(*remote_results_opt));
muts.insert(muts.end(), std::make_move_iterator(results_muts.begin()), std::make_move_iterator(results_muts.end()));
// If the replica successfully finished its work, we need to commit mutations generated above before selecting next task
skip_work_on_this_replica = !results_muts.empty();
}
// If there were no mutations for this replica, we can just remove the entry from `_remote_work` map
// and start new work in the same iteration.
// Otherwise, the entry needs to be removed after the mutations are committed successfully.
if (skip_work_on_this_replica) {
_remote_work_keys_to_erase.insert(replica);
} else {
_remote_work.erase(replica);
}
_remote_work.erase(replica);
}
if (!_gossiper.is_alive(replica.host)) {
const bool ignore_gossiper = utils::get_local_injector().enter("view_building_coordinator_ignore_gossiper");
if (!_gossiper.is_alive(replica.host) && !ignore_gossiper) {
vbc_logger.debug("Replica {} is dead", replica);
continue;
}
if (skip_work_on_this_replica) {
continue;
if (!_finished_tasks.contains(replica)) {
_finished_tasks.insert({replica, {}});
}
if (auto already_started_ids = _vb_sm.building_state.get_started_tasks(*_vb_sm.building_state.currently_processed_base_table, replica); !already_started_ids.empty()) {
// If the replica has any task in `STARTED` state, attach the coordinator to the work.
attach_to_started_tasks(replica, std::move(already_started_ids));
} else if (auto todo_ids = select_tasks_for_replica(replica); !todo_ids.empty()) {
// If the replica has no started tasks and there are tasks to do, mark them as started.
// The coordinator will attach itself to the work in next iteration.
auto new_mutations = co_await start_tasks(guard, std::move(todo_ids));
muts.insert(muts.end(), std::make_move_iterator(new_mutations.begin()), std::make_move_iterator(new_mutations.end()));
if (auto todo_ids = select_tasks_for_replica(replica); !todo_ids.empty()) {
start_remote_worker(replica, std::move(todo_ids));
} else {
vbc_logger.debug("Nothing to do for replica {}", replica);
}
}
if (!muts.empty()) {
co_await commit_mutations(std::move(guard), std::move(muts), "start view building tasks");
for (auto& key: _remote_work_keys_to_erase) {
_remote_work.erase(key);
}
co_return true;
}
co_return false;
}
std::set<locator::tablet_replica> view_building_coordinator::get_replicas_with_tasks() {
@@ -385,7 +418,7 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
// Select only building tasks and return theirs ids
auto filter_building_tasks = [] (const std::vector<view_building_task>& tasks) -> std::vector<utils::UUID> {
return tasks | std::views::filter([] (const view_building_task& t) {
return t.type == view_building_task::task_type::build_range && t.state == view_building_task::task_state::idle;
return t.type == view_building_task::task_type::build_range && !t.aborted;
}) | std::views::transform([] (const view_building_task& t) {
return t.id;
}) | std::ranges::to<std::vector>();
@@ -399,7 +432,29 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
}
auto& tablet_map = _db.get_token_metadata().tablets().get_tablet_map(*_vb_sm.building_state.currently_processed_base_table);
for (auto& [token, tasks]: _vb_sm.building_state.collect_tasks_by_last_token(*_vb_sm.building_state.currently_processed_base_table, replica)) {
auto tasks_by_last_token = _vb_sm.building_state.collect_tasks_by_last_token(*_vb_sm.building_state.currently_processed_base_table, replica);
// Remove completed tasks in `_finished_tasks` from `tasks_by_last_token`
auto it = tasks_by_last_token.begin();
while (it != tasks_by_last_token.end()) {
auto task_it = it->second.begin();
while (task_it != it->second.end()) {
if (_finished_tasks.at(replica).contains(task_it->id)) {
task_it = it->second.erase(task_it);
} else {
++task_it;
}
}
// Remove the entry from `tasks_by_last_token` if its vector is empty
if (it->second.empty()) {
it = tasks_by_last_token.erase(it);
} else {
++it;
}
}
for (auto& [token, tasks]: tasks_by_last_token) {
auto tid = tablet_map.get_tablet_id(token);
if (tablet_map.get_tablet_transition_info(tid)) {
vbc_logger.debug("Tablet {} on replica {} is in transition.", tid, replica);
@@ -411,7 +466,7 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
return building_tasks;
} else {
return tasks | std::views::filter([] (const view_building_task& t) {
return t.state == view_building_task::task_state::idle;
return !t.aborted;
}) | std::views::transform([] (const view_building_task& t) {
return t.id;
}) | std::ranges::to<std::vector>();
@@ -421,71 +476,41 @@ std::vector<utils::UUID> view_building_coordinator::select_tasks_for_replica(loc
return {};
}
future<utils::chunked_vector<mutation>> view_building_coordinator::start_tasks(const service::group0_guard& guard, std::vector<utils::UUID> tasks) {
vbc_logger.info("Starting tasks {}", tasks);
utils::chunked_vector<mutation> muts;
for (auto& t: tasks) {
auto mut = co_await _sys_ks.make_update_view_building_task_state_mutation(guard.write_timestamp(), t, view_building_task::task_state::started);
muts.push_back(std::move(mut));
}
co_return muts;
}
void view_building_coordinator::attach_to_started_tasks(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks) {
void view_building_coordinator::start_remote_worker(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks) {
vbc_logger.debug("Attaching to started tasks {} on replica {}", tasks, replica);
shared_future<std::optional<remote_work_results>> work = work_on_tasks(replica, std::move(tasks));
shared_future<std::optional<std::vector<utils::UUID>>> work = work_on_tasks(replica, std::move(tasks));
_remote_work.insert({replica, std::move(work)});
}
future<std::optional<view_building_coordinator::remote_work_results>> view_building_coordinator::work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks) {
std::vector<view_task_result> remote_results;
future<std::optional<std::vector<utils::UUID>>> view_building_coordinator::work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks) {
constexpr auto backoff_duration = std::chrono::seconds(1);
static thread_local logger::rate_limit rate_limit{backoff_duration};
std::vector<utils::UUID> remote_results;
bool rpc_failed = false;
try {
remote_results = co_await ser::view_rpc_verbs::send_work_on_view_building_tasks(&_messaging, replica.host, _as, tasks);
remote_results = co_await ser::view_rpc_verbs::send_work_on_view_building_tasks(&_messaging, replica.host, _as, _term, replica.shard, tasks);
} catch (...) {
vbc_logger.warn("Work on tasks {} on replica {}, failed with error: {}", tasks, replica, std::current_exception());
vbc_logger.log(log_level::warn, rate_limit, "Work on tasks {} on replica {}, failed with error: {}",
tasks, replica, std::current_exception());
rpc_failed = true;
}
if (rpc_failed) {
co_await seastar::sleep(backoff_duration);
_vb_sm.event.broadcast();
co_return std::nullopt;
}
if (tasks.size() != remote_results.size()) {
on_internal_error(vbc_logger, fmt::format("Number of tasks ({}) and results ({}) do not match for replica {}", tasks.size(), remote_results.size(), replica));
}
// In `view_building_coordinator::work_on_view_building()` we made sure that,
// each replica has its own entry in the `_finished_tasks`, so now we can just take a shared lock
// and insert its of finished tasks to this replica bucket as there is at most one instance of this method for each replica.
auto lock = co_await get_shared_lock(_mutex);
_finished_tasks.at(replica).insert_range(remote_results);
remote_work_results results;
for (size_t i = 0; i < tasks.size(); ++i) {
results.push_back({tasks[i], remote_results[i]});
}
_vb_sm.event.broadcast();
co_return results;
}
// Mark finished task as done (remove them from the table).
// Retry failed tasks if possible (if failed tasks wasn't aborted).
future<utils::chunked_vector<mutation>> view_building_coordinator::update_state_after_work_is_done(const service::group0_guard& guard, const locator::tablet_replica& replica, view_building_coordinator::remote_work_results results) {
vbc_logger.debug("Got results from replica {}: {}", replica, results);
utils::chunked_vector<mutation> muts;
for (auto& result: results) {
vbc_logger.info("Task {} was finished with result: {}", result.first, result.second);
if (!_vb_sm.building_state.currently_processed_base_table) {
continue;
}
// A task can be aborted by deleting it or by setting its state to `ABORTED`.
// If the task was aborted by changing the state,
// we shouldn't remove it here because it might be needed
// to generate updated after tablet operation (migration/resize)
// is finished.
auto task_opt = _vb_sm.building_state.get_task(*_vb_sm.building_state.currently_processed_base_table, replica, result.first);
if (task_opt && task_opt->get().state != view_building_task::task_state::aborted) {
// Otherwise, the task was completed successfully and we can remove it.
auto delete_mut = co_await _sys_ks.make_remove_view_building_task_mutation(guard.write_timestamp(), result.first);
muts.push_back(std::move(delete_mut));
}
}
co_return muts;
co_return remote_results;
}
future<> view_building_coordinator::stop() {
@@ -515,7 +540,7 @@ void view_building_coordinator::generate_tablet_migration_updates(utils::chunked
auto create_task_copy_on_pending_replica = [&] (const view_building_task& task) {
auto new_id = builder.new_id();
builder.set_type(new_id, task.type)
.set_state(new_id, view_building_task::task_state::idle)
.set_aborted(new_id, false)
.set_base_id(new_id, task.base_id)
.set_last_token(new_id, task.last_token)
.set_replica(new_id, *trinfo.pending_replica);
@@ -583,7 +608,7 @@ void view_building_coordinator::generate_tablet_resize_updates(utils::chunked_ve
auto create_task_copy = [&] (const view_building_task& task, dht::token last_token) -> utils::UUID {
auto new_id = builder.new_id();
builder.set_type(new_id, task.type)
.set_state(new_id, view_building_task::task_state::idle)
.set_aborted(new_id, false)
.set_base_id(new_id, task.base_id)
.set_last_token(new_id, last_token)
.set_replica(new_id, task.replica);
@@ -652,7 +677,7 @@ void view_building_coordinator::abort_tasks(utils::chunked_vector<canonical_muta
auto abort_task_map = [&] (const task_map& task_map) {
for (auto& [id, _]: task_map) {
vbc_logger.debug("Aborting task {}", id);
builder.set_state(id, view_building_task::task_state::aborted);
builder.set_aborted(id, true);
}
};
@@ -682,7 +707,7 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,
for (auto& [id, task]: task_map) {
if (task.last_token == last_token) {
vbc_logger.debug("Aborting task {}", id);
builder.set_state(id, view_building_task::task_state::aborted);
builder.set_aborted(id, true);
}
}
};
@@ -698,10 +723,10 @@ void abort_view_building_tasks(const view_building_state_machine& vb_sm,
static void rollback_task_map(view_building_task_mutation_builder& builder, const task_map& task_map) {
for (auto& [id, task]: task_map) {
if (task.state == view_building_task::task_state::aborted) {
if (task.aborted) {
auto new_id = builder.new_id();
builder.set_type(new_id, task.type)
.set_state(new_id, view_building_task::task_state::idle)
.set_aborted(new_id, false)
.set_base_id(new_id, task.base_id)
.set_last_token(new_id, task.last_token)
.set_replica(new_id, task.replica);

View File

@@ -54,9 +54,9 @@ class view_building_coordinator : public service::endpoint_lifecycle_subscriber
const raft::term_t _term;
abort_source& _as;
using remote_work_results = std::vector<std::pair<utils::UUID, db::view::view_task_result>>;
std::unordered_map<locator::tablet_replica, shared_future<std::optional<remote_work_results>>> _remote_work;
std::unordered_map<locator::tablet_replica, shared_future<std::optional<std::vector<utils::UUID>>>> _remote_work;
shared_mutex _mutex; // guards `_finished_tasks` field
std::unordered_map<locator::tablet_replica, std::unordered_set<utils::UUID>> _finished_tasks;
public:
view_building_coordinator(replica::database& db, raft::server& raft, service::raft_group0& group0,
@@ -86,9 +86,11 @@ private:
future<> commit_mutations(service::group0_guard guard, utils::chunked_vector<mutation> mutations, std::string_view description);
void handle_coordinator_error(std::exception_ptr eptr);
future<> finished_task_gc_fiber();
future<> clean_finished_tasks();
future<std::optional<service::group0_guard>> update_state(service::group0_guard guard);
// Returns if any new tasks were started
future<bool> work_on_view_building(service::group0_guard guard);
future<> work_on_view_building(service::group0_guard guard);
future<> mark_view_build_status_started(const service::group0_guard& guard, table_id view_id, utils::chunked_vector<mutation>& out);
future<> mark_all_remaining_view_build_statuses_started(const service::group0_guard& guard, table_id base_id, utils::chunked_vector<mutation>& out);
@@ -97,10 +99,8 @@ private:
std::set<locator::tablet_replica> get_replicas_with_tasks();
std::vector<utils::UUID> select_tasks_for_replica(locator::tablet_replica replica);
future<utils::chunked_vector<mutation>> start_tasks(const service::group0_guard& guard, std::vector<utils::UUID> tasks);
void attach_to_started_tasks(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks);
future<std::optional<remote_work_results>> work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks);
future<utils::chunked_vector<mutation>> update_state_after_work_is_done(const service::group0_guard& guard, const locator::tablet_replica& replica, remote_work_results results);
void start_remote_worker(const locator::tablet_replica& replica, std::vector<utils::UUID> tasks);
future<std::optional<std::vector<utils::UUID>>> work_on_tasks(locator::tablet_replica replica, std::vector<utils::UUID> tasks);
};
void abort_view_building_tasks(const db::view::view_building_state_machine& vb_sm,

View File

@@ -13,10 +13,10 @@ namespace db {
namespace view {
view_building_task::view_building_task(utils::UUID id, task_type type, task_state state, table_id base_id, std::optional<table_id> view_id, locator::tablet_replica replica, dht::token last_token)
view_building_task::view_building_task(utils::UUID id, task_type type, bool aborted, table_id base_id, std::optional<table_id> view_id, locator::tablet_replica replica, dht::token last_token)
: id(id)
, type(type)
, state(state)
, aborted(aborted)
, base_id(base_id)
, view_id(view_id)
, replica(replica)
@@ -49,30 +49,6 @@ seastar::sstring task_type_to_sstring(view_building_task::task_type type) {
}
}
view_building_task::task_state task_state_from_string(std::string_view str) {
if (str == "IDLE") {
return view_building_task::task_state::idle;
}
if (str == "STARTED") {
return view_building_task::task_state::started;
}
if (str == "ABORTED") {
return view_building_task::task_state::aborted;
}
throw std::runtime_error(fmt::format("Unknown view building task state: {}", str));
}
seastar::sstring task_state_to_sstring(view_building_task::task_state state) {
switch (state) {
case view_building_task::task_state::idle:
return "IDLE";
case view_building_task::task_state::started:
return "STARTED";
case view_building_task::task_state::aborted:
return "ABORTED";
}
}
std::optional<std::reference_wrapper<const view_building_task>> view_building_state::get_task(table_id base_id, locator::tablet_replica replica, utils::UUID id) const {
if (!tasks_state.contains(base_id) || !tasks_state.at(base_id).contains(replica)) {
return {};
@@ -151,46 +127,6 @@ std::map<dht::token, std::vector<view_building_task>> view_building_state::colle
return tasks;
}
// Returns all tasks for `_vb_sm.building_state.currently_processed_base_table` and `replica` with `STARTED` state.
std::vector<utils::UUID> view_building_state::get_started_tasks(table_id base_table_id, locator::tablet_replica replica) const {
if (!tasks_state.contains(base_table_id) || !tasks_state.at(base_table_id).contains(replica)) {
// No tasks for this replica
return {};
}
std::vector<view_building_task> tasks;
auto& replica_tasks = tasks_state.at(base_table_id).at(replica);
for (auto& [_, view_tasks]: replica_tasks.view_tasks) {
for (auto& [_, task]: view_tasks) {
if (task.state == view_building_task::task_state::started) {
tasks.push_back(task);
}
}
}
for (auto& [_, task]: replica_tasks.staging_tasks) {
if (task.state == view_building_task::task_state::started) {
tasks.push_back(task);
}
}
// All collected tasks should have the same: type, base_id and last_token,
// so they can be executed in the same view_building_worker::batch.
#ifdef SEASTAR_DEBUG
if (!tasks.empty()) {
auto& task = tasks.front();
for (auto& t: tasks) {
SCYLLA_ASSERT(task.type == t.type);
SCYLLA_ASSERT(task.base_id == t.base_id);
SCYLLA_ASSERT(task.last_token == t.last_token);
}
}
#endif
return tasks | std::views::transform([] (const view_building_task& t) {
return t.id;
}) | std::ranges::to<std::vector>();
}
}
}

View File

@@ -39,28 +39,17 @@ struct view_building_task {
process_staging,
};
// When a task is created, it starts with `IDLE` state.
// Then, the view building coordinator will decide to do the task and it will
// set the state to `STARTED`.
// When a task is finished the entry is removed.
//
// If a task is in progress when a tablet operation (migration/resize) starts,
// the task's state is set to `ABORTED`.
enum class task_state {
idle,
started,
aborted,
};
utils::UUID id;
task_type type;
task_state state;
bool aborted;
table_id base_id;
std::optional<table_id> view_id; // nullopt when task_type is `process_staging`
locator::tablet_replica replica;
dht::token last_token;
view_building_task(utils::UUID id, task_type type, task_state state,
view_building_task(utils::UUID id, task_type type, bool aborted,
table_id base_id, std::optional<table_id> view_id,
locator::tablet_replica replica, dht::token last_token);
};
@@ -92,7 +81,6 @@ struct view_building_state {
std::vector<std::reference_wrapper<const view_building_task>> get_tasks_for_host(table_id base_id, locator::host_id host) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id) const;
std::map<dht::token, std::vector<view_building_task>> collect_tasks_by_last_token(table_id base_table_id, const locator::tablet_replica& replica) const;
std::vector<utils::UUID> get_started_tasks(table_id base_table_id, locator::tablet_replica replica) const;
};
// Represents global state of tablet-based views.
@@ -113,18 +101,8 @@ struct view_building_state_machine {
condition_variable event;
};
struct view_task_result {
enum class command_status: uint8_t {
success = 0,
abort = 1,
};
db::view::view_task_result::command_status status;
};
view_building_task::task_type task_type_from_string(std::string_view str);
seastar::sstring task_type_to_sstring(view_building_task::task_type type);
view_building_task::task_state task_state_from_string(std::string_view str);
seastar::sstring task_state_to_sstring(view_building_task::task_state state);
} // namespace view_building
@@ -136,17 +114,11 @@ template <> struct fmt::formatter<db::view::view_building_task::task_type> : fmt
}
};
template <> struct fmt::formatter<db::view::view_building_task::task_state> : fmt::formatter<string_view> {
auto format(db::view::view_building_task::task_state state, fmt::format_context& ctx) const {
return fmt::format_to(ctx.out(), "{}", db::view::task_state_to_sstring(state));
}
};
template <> struct fmt::formatter<db::view::view_building_task> : fmt::formatter<string_view> {
auto format(db::view::view_building_task task, fmt::format_context& ctx) const {
auto view_id = task.view_id ? fmt::to_string(*task.view_id) : "nullopt";
return fmt::format_to(ctx.out(), "view_building_task{{type: {}, state: {}, base_id: {}, view_id: {}, last_token: {}}}",
task.type, task.state, task.base_id, view_id, task.last_token);
return fmt::format_to(ctx.out(), "view_building_task{{type: {}, aborted: {}, base_id: {}, view_id: {}, last_token: {}}}",
task.type, task.aborted, task.base_id, view_id, task.last_token);
}
};
@@ -161,18 +133,3 @@ template <> struct fmt::formatter<db::view::replica_tasks> : fmt::formatter<stri
return fmt::format_to(ctx.out(), "{{view_tasks: {}, staging_tasks: {}}}", replica_tasks.view_tasks, replica_tasks.staging_tasks);
}
};
template <> struct fmt::formatter<db::view::view_task_result> : fmt::formatter<string_view> {
auto format(db::view::view_task_result result, fmt::format_context& ctx) const {
std::string_view res;
switch (result.status) {
case db::view::view_task_result::command_status::success:
res = "success";
break;
case db::view::view_task_result::command_status::abort:
res = "abort";
break;
}
return format_to(ctx.out(), "{}", res);
}
};

View File

@@ -25,8 +25,8 @@ view_building_task_mutation_builder& view_building_task_mutation_builder::set_ty
_m.set_clustered_cell(get_ck(id), "type", data_value(task_type_to_sstring(type)), _ts);
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::set_state(utils::UUID id, db::view::view_building_task::task_state state) {
_m.set_clustered_cell(get_ck(id), "state", data_value(task_state_to_sstring(state)), _ts);
view_building_task_mutation_builder& view_building_task_mutation_builder::set_aborted(utils::UUID id, bool aborted) {
_m.set_clustered_cell(get_ck(id), "aborted", data_value(aborted), _ts);
return *this;
}
view_building_task_mutation_builder& view_building_task_mutation_builder::set_base_id(utils::UUID id, table_id base_id) {

View File

@@ -32,7 +32,7 @@ public:
static utils::UUID new_id();
view_building_task_mutation_builder& set_type(utils::UUID id, db::view::view_building_task::task_type type);
view_building_task_mutation_builder& set_state(utils::UUID id, db::view::view_building_task::task_state state);
view_building_task_mutation_builder& set_aborted(utils::UUID id, bool aborted);
view_building_task_mutation_builder& set_base_id(utils::UUID id, table_id base_id);
view_building_task_mutation_builder& set_view_id(utils::UUID id, table_id view_id);
view_building_task_mutation_builder& set_last_token(utils::UUID id, dht::token last_token);

View File

@@ -22,6 +22,7 @@
#include "replica/database.hh"
#include "service/storage_proxy.hh"
#include "service/raft/raft_group0_client.hh"
#include "service/raft/raft_group0.hh"
#include "schema/schema_fwd.hh"
#include "idl/view.dist.hh"
#include "sstables/sstables.hh"
@@ -114,11 +115,11 @@ static locator::tablet_id get_sstable_tablet_id(const locator::tablet_map& table
return tablet_id;
}
view_building_worker::view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier, service::raft_group0_client& group0_client, view_update_generator& vug, netw::messaging_service& ms, view_building_state_machine& vbsm)
view_building_worker::view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier, service::raft_group0& group0, view_update_generator& vug, netw::messaging_service& ms, view_building_state_machine& vbsm)
: _db(db)
, _sys_ks(sys_ks)
, _mnotifier(mnotifier)
, _group0_client(group0_client)
, _group0(group0)
, _vug(vug)
, _messaging(ms)
, _vb_state_machine(vbsm)
@@ -127,8 +128,9 @@ view_building_worker::view_building_worker(replica::database& db, db::system_key
init_messaging_service();
}
void view_building_worker::start_background_fibers() {
future<> view_building_worker::init() {
SCYLLA_ASSERT(this_shard_id() == 0);
co_await discover_existing_staging_sstables();
_staging_sstables_registrator = run_staging_sstables_registrator();
_view_building_state_observer = run_view_building_state_observer();
_mnotifier.register_listener(this);
@@ -144,6 +146,7 @@ future<> view_building_worker::drain() {
if (!_as.abort_requested()) {
_as.request_abort();
}
_state._mutex.broken();
_staging_sstables_mutex.broken();
_sstables_to_register_event.broken();
if (this_shard_id() == 0) {
@@ -153,8 +156,7 @@ future<> view_building_worker::drain() {
co_await std::move(state_observer);
co_await _mnotifier.unregister_listener(this);
}
co_await _state.clear_state();
_state.state_updated_cv.broken();
co_await _state.clear();
co_await uninit_messaging_service();
}
@@ -195,8 +197,6 @@ future<> view_building_worker::register_staging_sstable_tasks(std::vector<sstabl
}
future<> view_building_worker::run_staging_sstables_registrator() {
co_await discover_existing_staging_sstables();
while (!_as.abort_requested()) {
try {
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
@@ -225,44 +225,42 @@ future<> view_building_worker::create_staging_sstable_tasks() {
utils::chunked_vector<canonical_mutation> cmuts;
auto guard = co_await _group0_client.start_operation(_as);
auto guard = co_await _group0.client().start_operation(_as);
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
for (auto& [table_id, sst_infos]: _sstables_to_register) {
for (auto& sst_info: sst_infos) {
view_building_task task {
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, view_building_task::task_state::idle,
utils::UUID_gen::get_time_UUID(), view_building_task::task_type::process_staging, false,
table_id, ::table_id{}, {my_host_id, sst_info.shard}, sst_info.last_token
};
auto mut = co_await _group0_client.sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
auto mut = co_await _group0.client().sys_ks().make_view_building_task_mutation(guard.write_timestamp(), task);
cmuts.emplace_back(std::move(mut));
}
}
vbw_logger.debug("Creating {} process_staging view_building_tasks", cmuts.size());
auto cmd = _group0_client.prepare_command(service::write_mutations{std::move(cmuts)}, guard, "create view building tasks");
co_await _group0_client.add_entry(std::move(cmd), std::move(guard), _as);
auto cmd = _group0.client().prepare_command(service::write_mutations{std::move(cmuts)}, guard, "create view building tasks");
co_await _group0.client().add_entry(std::move(cmd), std::move(guard), _as);
// Move staging sstables from `_sstables_to_register` (on shard0) to `_staging_sstables` on corresponding shards.
// Firstly reorgenize `_sstables_to_register` for easier movement.
// This is done in separate loop after commiting the group0 command, because we need to move values from `_sstables_to_register`
// (`staging_sstable_task_info` is non-copyable because of `foreign_ptr` field).
std::unordered_map<shard_id, std::unordered_map<table_id, std::unordered_map<dht::token, std::vector<foreign_ptr<sstables::shared_sstable>>>>> new_sstables_per_shard;
std::unordered_map<shard_id, std::unordered_map<table_id, std::vector<foreign_ptr<sstables::shared_sstable>>>> new_sstables_per_shard;
for (auto& [table_id, sst_infos]: _sstables_to_register) {
for (auto& sst_info: sst_infos) {
new_sstables_per_shard[sst_info.shard][table_id][sst_info.last_token].push_back(std::move(sst_info.sst_foreign_ptr));
new_sstables_per_shard[sst_info.shard][table_id].push_back(std::move(sst_info.sst_foreign_ptr));
}
}
for (auto& [shard, sstables_per_table]: new_sstables_per_shard) {
co_await container().invoke_on(shard, [sstables_for_this_shard = std::move(sstables_per_table)] (view_building_worker& local_vbw) mutable {
for (auto& [tid, ssts_map]: sstables_for_this_shard) {
for (auto& [token, ssts]: ssts_map) {
auto unwrapped_ssts = ssts | std::views::as_rvalue | std::views::transform([] (auto&& fptr) {
return fptr.unwrap_on_owner_shard();
}) | std::ranges::to<std::vector>();
auto& tid_ssts = local_vbw._staging_sstables[tid][token];
tid_ssts.insert(tid_ssts.end(), std::make_move_iterator(unwrapped_ssts.begin()), std::make_move_iterator(unwrapped_ssts.end()));
}
for (auto& [tid, ssts]: sstables_for_this_shard) {
auto unwrapped_ssts = ssts | std::views::as_rvalue | std::views::transform([] (auto&& fptr) {
return fptr.unwrap_on_owner_shard();
}) | std::ranges::to<std::vector>();
auto& tid_ssts = local_vbw._staging_sstables[tid];
tid_ssts.insert(tid_ssts.end(), std::make_move_iterator(unwrapped_ssts.begin()), std::make_move_iterator(unwrapped_ssts.end()));
}
});
}
@@ -310,7 +308,10 @@ std::unordered_map<table_id, std::vector<view_building_worker::staging_sstable_t
return;
}
auto& tablet_map = _db.get_token_metadata().tablets().get_tablet_map(table_id);
// scylladb/scylladb#26403: Make sure to access the tablets map via the effective replication map of the table object.
// The token metadata object pointed to by the database (`_db.get_token_metadata()`) may not contain
// the tablets map of the currently processed table yet. After #24414 is fixed, this should not matter anymore.
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto sstables = table->get_sstables();
for (auto sstable: *sstables) {
if (!sstable->requires_view_building()) {
@@ -326,7 +327,7 @@ std::unordered_map<table_id, std::vector<view_building_worker::staging_sstable_t
// or maybe it can be registered to view_update_generator directly.
tasks_to_create[table_id].emplace_back(table_id, shard, last_token, make_foreign(std::move(sstable)));
} else {
_staging_sstables[table_id][last_token].push_back(std::move(sstable));
_staging_sstables[table_id].push_back(std::move(sstable));
}
}
});
@@ -342,10 +343,10 @@ future<> view_building_worker::run_view_building_state_observer() {
bool sleep = false;
try {
vbw_logger.trace("view_building_state_observer() iteration");
auto read_apply_mutex_holder = co_await _group0_client.hold_read_apply_mutex(_as);
auto read_apply_mutex_holder = co_await _group0.client().hold_read_apply_mutex(_as);
co_await update_built_views();
co_await update_building_state();
co_await check_for_aborted_tasks();
_as.check();
read_apply_mutex_holder.return_all();
@@ -376,7 +377,7 @@ future<> view_building_worker::update_built_views() {
auto schema = _db.find_schema(table_id);
return std::make_pair(schema->ks_name(), schema->cf_name());
};
auto& sys_ks = _group0_client.sys_ks();
auto& sys_ks = _group0.client().sys_ks();
std::set<std::pair<sstring, sstring>> built_views;
for (auto& [id, statuses]: _vb_state_machine.views_state.status_map) {
@@ -405,22 +406,35 @@ future<> view_building_worker::update_built_views() {
}
}
future<> view_building_worker::update_building_state() {
co_await _state.update(*this);
co_await _state.finish_completed_tasks();
_state.state_updated_cv.broadcast();
}
// Must be executed on shard0
future<> view_building_worker::check_for_aborted_tasks() {
return container().invoke_on_all([building_state = _vb_state_machine.building_state] (view_building_worker& vbw) -> future<> {
auto lock = co_await get_units(vbw._state._mutex, 1, vbw._as);
co_await vbw._state.update_processing_base_table(vbw._db, building_state, vbw._as);
if (!vbw._state._batch) {
co_return;
}
bool view_building_worker::is_shard_free(shard_id shard) {
return !std::ranges::any_of(_state.tasks_map, [&shard] (auto& task_entry) {
return task_entry.second->replica.shard == shard && task_entry.second->state == view_building_worker::batch_state::in_progress;
auto my_host_id = vbw._db.get_token_metadata().get_topology().my_host_id();
auto my_replica = locator::tablet_replica{my_host_id, this_shard_id()};
auto tasks_map = vbw._state._batch->tasks; // Potentially, we'll remove elements from the map, so we need a copy to iterate over it
for (auto& [id, t]: tasks_map) {
auto task_opt = building_state.get_task(t.base_id, my_replica, id);
if (!task_opt || task_opt->get().aborted) {
co_await vbw._state._batch->abort_task(id);
}
}
if (vbw._state._batch->tasks.empty()) {
co_await vbw._state.clean_up_after_batch();
}
});
}
void view_building_worker::init_messaging_service() {
ser::view_rpc_verbs::register_work_on_view_building_tasks(&_messaging, [this] (std::vector<utils::UUID> ids) -> future<std::vector<view_task_result>> {
return container().invoke_on(0, [ids = std::move(ids)] (view_building_worker& vbw) mutable -> future<std::vector<view_task_result>> {
return vbw.work_on_tasks(std::move(ids));
ser::view_rpc_verbs::register_work_on_view_building_tasks(&_messaging, [this] (raft::term_t term, shard_id shard, std::vector<utils::UUID> ids) -> future<std::vector<utils::UUID>> {
return container().invoke_on(shard, [term, ids = std::move(ids)] (auto& vbw) mutable -> future<std::vector<utils::UUID>> {
return vbw.work_on_tasks(term, std::move(ids));
});
});
}
@@ -429,235 +443,53 @@ future<> view_building_worker::uninit_messaging_service() {
return ser::view_rpc_verbs::unregister(&_messaging);
}
future<std::vector<view_task_result>> view_building_worker::work_on_tasks(std::vector<utils::UUID> ids) {
vbw_logger.debug("Got request for results of tasks: {}", ids);
auto guard = co_await _group0_client.start_operation(_as, service::raft_timeout{});
auto processing_base_table = _state.processing_base_table;
auto are_tasks_finished = [&] () {
return std::ranges::all_of(ids, [this] (const utils::UUID& id) {
return _state.finished_tasks.contains(id) || _state.aborted_tasks.contains(id);
});
};
auto get_results = [&] () -> std::vector<view_task_result> {
std::vector<view_task_result> results;
for (const auto& id: ids) {
if (_state.finished_tasks.contains(id)) {
results.emplace_back(view_task_result::command_status::success);
} else if (_state.aborted_tasks.contains(id)) {
results.emplace_back(view_task_result::command_status::abort);
} else {
// This means that the task was aborted. Throw an error,
// so the coordinator will refresh its state and retry without aborted IDs.
throw std::runtime_error(fmt::format("No status for task {}", id));
}
}
return results;
};
if (are_tasks_finished()) {
// If the batch is already finished, we can return the results immediately.
vbw_logger.debug("Batch with tasks {} is already finished, returning results", ids);
co_return get_results();
}
// All of the tasks should be executed in the same batch
// (their statuses are set to started in the same group0 operation).
// If any ID is not present in the `tasks_map`, it means that it was aborted and we should fail this RPC call,
// so the coordinator can retry without aborted IDs.
// That's why we can identify the batch by random (.front()) ID from the `ids` vector.
auto id = ids.front();
while (!_state.tasks_map.contains(id) && processing_base_table == _state.processing_base_table) {
vbw_logger.warn("Batch with task {} is not found in tasks map, waiting until worker updates its state", id);
service::release_guard(std::move(guard));
co_await _state.state_updated_cv.wait();
guard = co_await _group0_client.start_operation(_as, service::raft_timeout{});
}
if (processing_base_table != _state.processing_base_table) {
// If the processing base table was changed, we should fail this RPC call because the tasks were aborted.
throw std::runtime_error(fmt::format("Processing base table was changed to {} ", _state.processing_base_table));
}
// Validate that any of the IDs wasn't aborted.
for (const auto& tid: ids) {
if (!_state.tasks_map[id]->tasks.contains(tid)) {
vbw_logger.warn("Task {} is not found in the batch", tid);
throw std::runtime_error(fmt::format("Task {} is not found in the batch", tid));
}
}
if (_state.tasks_map[id]->state == view_building_worker::batch_state::idle) {
vbw_logger.debug("Starting batch with tasks {}", _state.tasks_map[id]->tasks);
if (!is_shard_free(_state.tasks_map[id]->replica.shard)) {
throw std::runtime_error(fmt::format("Tried to start view building tasks ({}) on shard {} but the shard is busy", _state.tasks_map[id]->tasks, _state.tasks_map[id]->replica.shard, _state.tasks_map[id]->tasks));
}
_state.tasks_map[id]->start();
}
service::release_guard(std::move(guard));
while (!_as.abort_requested()) {
auto read_apply_mutex_holder = co_await _group0_client.hold_read_apply_mutex(_as);
if (are_tasks_finished()) {
co_return get_results();
}
// Check if the batch is still alive
if (!_state.tasks_map.contains(id)) {
throw std::runtime_error(fmt::format("Batch with task {} is not found in tasks map anymore.", id));
}
read_apply_mutex_holder.return_all();
co_await _state.tasks_map[id]->batch_done_cv.wait();
}
throw std::runtime_error("View building worker was aborted");
}
// Validates if the task can be executed in a batch on the same shard.
static bool validate_can_be_one_batch(const view_building_task& t1, const view_building_task& t2) {
return t1.type == t2.type && t1.base_id == t2.base_id && t1.replica == t2.replica && t1.last_token == t2.last_token;
}
static std::unordered_set<table_id> get_ids_of_all_views(replica::database& db, table_id table_id) {
return db.find_column_family(table_id).views() | std::views::transform([] (view_ptr vptr) {
return vptr->id();
}) | std::ranges::to<std::unordered_set>();;
}
future<> view_building_worker::local_state::flush_table(view_building_worker& vbw, table_id table_id) {
// `table_id` should point to currently processing base table but
// `view_building_worker::local_state::processing_base_table` may not be set to it yet,
// so we need to pass it directly
co_await vbw.container().invoke_on_all([table_id] (view_building_worker& local_vbw) -> future<> {
auto base_cf = local_vbw._db.find_column_family(table_id).shared_from_this();
co_await when_all(base_cf->await_pending_writes(), base_cf->await_pending_streams());
co_await flush_base(base_cf, local_vbw._as);
});
flushed_views = get_ids_of_all_views(vbw._db, table_id);
}
future<> view_building_worker::local_state::update(view_building_worker& vbw) {
const auto& vb_state = vbw._vb_state_machine.building_state;
// Check if the base table to process was changed.
// If so, we clear the state, aborting tasks for previous base table and starting new ones for the new base table.
if (processing_base_table != vb_state.currently_processed_base_table) {
co_await clear_state();
if (vb_state.currently_processed_base_table) {
// When we start to process new base table, we need to flush its current data, so we can build the view.
co_await flush_table(vbw, *vb_state.currently_processed_base_table);
}
processing_base_table = vb_state.currently_processed_base_table;
vbw_logger.info("Processing base table was changed to: {}", processing_base_table);
}
if (!processing_base_table) {
vbw_logger.debug("No base table is selected to be processed.");
co_return;
}
std::vector<table_id> new_views;
auto all_view_ids = get_ids_of_all_views(vbw._db, *processing_base_table);
std::ranges::set_difference(all_view_ids, flushed_views, std::back_inserter(new_views));
if (!new_views.empty()) {
// Flush base table again in any new view was created, so the view building tasks will see up-to-date sstables.
// Otherwise, we may lose mutations created after previous flush but before the new view was created.
co_await flush_table(vbw, *processing_base_table);
}
auto erm = vbw._db.find_column_family(*processing_base_table).get_effective_replication_map();
auto my_host_id = erm->get_topology().my_host_id();
auto current_tasks_for_this_host = vb_state.get_tasks_for_host(*processing_base_table, my_host_id);
// scan view building state, collect alive and new (in STARTED state but not started by this worker) tasks
std::unordered_map<shard_id, std::vector<view_building_task>> new_tasks;
std::unordered_set<utils::UUID> alive_tasks; // save information about alive tasks to cleanup done/aborted ones
for (auto& task_ref: current_tasks_for_this_host) {
auto& task = task_ref.get();
auto id = task.id;
if (task.state != view_building_task::task_state::aborted) {
alive_tasks.insert(id);
}
if (tasks_map.contains(id) || finished_tasks.contains(id)) {
continue;
}
else if (task.state == view_building_task::task_state::started) {
auto shard = task.replica.shard;
if (new_tasks.contains(shard) && !validate_can_be_one_batch(new_tasks[shard].front(), task)) {
// Currently we allow only one batch per shard at a time
on_internal_error(vbw_logger, fmt::format("Got not-compatible tasks for the same shard. Task: {}, other: {}", new_tasks[shard].front(), task));
}
new_tasks[shard].push_back(task);
}
co_await coroutine::maybe_yield();
}
auto tasks_map_copy = tasks_map;
// Clear aborted tasks from tasks_map
for (auto it = tasks_map_copy.begin(); it != tasks_map_copy.end();) {
if (!alive_tasks.contains(it->first)) {
vbw_logger.debug("Aborting task {}", it->first);
aborted_tasks.insert(it->first);
co_await it->second->abort_task(it->first);
it = tasks_map_copy.erase(it);
} else {
++it;
}
}
// Create batches for new tasks
for (const auto& [shard, shard_tasks]: new_tasks) {
auto tasks = shard_tasks | std::views::transform([] (const view_building_task& t) {
return std::make_pair(t.id, t);
}) | std::ranges::to<std::unordered_map>();
auto batch = seastar::make_shared<view_building_worker::batch>(vbw.container(), tasks, shard_tasks.front().base_id, shard_tasks.front().replica);
for (auto& [id, _]: tasks) {
tasks_map_copy.insert({id, batch});
}
co_await coroutine::maybe_yield();
}
tasks_map = std::move(tasks_map_copy);
}
future<> view_building_worker::local_state::finish_completed_tasks() {
for (auto it = tasks_map.begin(); it != tasks_map.end();) {
if (it->second->state == view_building_worker::batch_state::idle) {
++it;
} else if (it->second->state == view_building_worker::batch_state::in_progress) {
vbw_logger.debug("Task {} is still in progress", it->first);
++it;
} else {
co_await it->second->work.get_future();
finished_tasks.insert(it->first);
vbw_logger.info("Task {} was completed", it->first);
it->second->batch_done_cv.broadcast();
it = tasks_map.erase(it);
// If `state::processing_base_table` is diffrent that the `view_building_state::currently_processed_base_table`,
// clear the state, save and flush new base table
future<> view_building_worker::state::update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as) {
if (processing_base_table != building_state.currently_processed_base_table) {
co_await clear();
if (building_state.currently_processed_base_table) {
co_await flush_base_table(db, *building_state.currently_processed_base_table, as);
}
processing_base_table = building_state.currently_processed_base_table;
}
}
future<> view_building_worker::local_state::clear_state() {
for (auto& [_, batch]: tasks_map) {
co_await batch->abort();
// If `_batch` ptr points to valid object, co_await its `work` future, save completed tasks and delete the object
future<> view_building_worker::state::clean_up_after_batch() {
if (_batch) {
co_await std::move(_batch->work);
for (auto& [id, _]: _batch->tasks) {
completed_tasks.insert(id);
}
_batch = nullptr;
}
}
// Flush base table, set is as currently processing base table and save which views exist at the time of flush
future<> view_building_worker::state::flush_base_table(replica::database& db, table_id base_table_id, abort_source& as) {
auto cf = db.find_column_family(base_table_id).shared_from_this();
co_await when_all(cf->await_pending_writes(), cf->await_pending_streams());
co_await flush_base(cf, as);
processing_base_table = base_table_id;
flushed_views = get_ids_of_all_views(db, base_table_id);
}
future<> view_building_worker::state::clear() {
if (_batch) {
_batch->as.request_abort();
co_await std::move(_batch->work);
_batch = nullptr;
}
processing_base_table.reset();
completed_tasks.clear();
flushed_views.clear();
tasks_map.clear();
finished_tasks.clear();
aborted_tasks.clear();
state_updated_cv.broadcast();
vbw_logger.debug("View building worker state was cleared.");
}
view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica)
@@ -667,16 +499,12 @@ view_building_worker::batch::batch(sharded<view_building_worker>& vbw, std::unor
, _vbw(vbw) {}
void view_building_worker::batch::start() {
if (this_shard_id() != 0) {
on_internal_error(vbw_logger, "view_building_worker::batch should be started on shard0");
if (this_shard_id() != replica.shard) {
on_internal_error(vbw_logger, "view_building_worker::batch should be started on replica shard");
}
state = batch_state::in_progress;
work = smp::submit_to(replica.shard, [this] () -> future<> {
return do_work();
}).finally([this] () {
state = batch_state::finished;
_vbw.local()._vb_state_machine.event.broadcast();
work = do_work().finally([this] {
promise.set_value();
});
}
@@ -691,10 +519,6 @@ future<> view_building_worker::batch::abort() {
co_await smp::submit_to(replica.shard, [this] () {
as.request_abort();
});
if (work.valid()) {
co_await work.get_future();
}
}
future<> view_building_worker::batch::do_work() {
@@ -837,15 +661,174 @@ future<> view_building_worker::do_build_range(table_id base_id, std::vector<tabl
}
future<> view_building_worker::do_process_staging(table_id table_id, dht::token last_token) {
if (_staging_sstables[table_id][last_token].empty()) {
if (_staging_sstables[table_id].empty()) {
co_return;
}
auto table = _db.get_tables_metadata().get_table(table_id).shared_from_this();
auto sstables = std::exchange(_staging_sstables[table_id][last_token], {});
co_await _vug.process_staging_sstables(std::move(table), std::move(sstables));
auto& tablet_map = table->get_effective_replication_map()->get_token_metadata().tablets().get_tablet_map(table_id);
auto tid = tablet_map.get_tablet_id(last_token);
auto tablet_range = tablet_map.get_token_range(tid);
// Select sstables belonging to the tablet (identified by `last_token`)
std::vector<sstables::shared_sstable> sstables_to_process;
for (auto& sst: _staging_sstables[table_id]) {
auto sst_last_token = sst->get_last_decorated_key().token();
if (tablet_range.contains(sst_last_token, dht::token_comparator())) {
sstables_to_process.push_back(sst);
}
}
co_await _vug.process_staging_sstables(std::move(table), sstables_to_process);
try {
// Remove processed sstables from `_staging_sstables` map
auto lock = co_await get_units(_staging_sstables_mutex, 1, _as);
std::unordered_set<sstables::shared_sstable> sstables_to_remove(sstables_to_process.begin(), sstables_to_process.end());
auto [first, last] = std::ranges::remove_if(_staging_sstables[table_id], [&] (auto& sst) {
return sstables_to_remove.contains(sst);
});
_staging_sstables[table_id].erase(first, last);
} catch (semaphore_aborted&) {
vbw_logger.warn("Semaphore was aborted while waiting to removed processed sstables for table {}", table_id);
}
}
void view_building_worker::load_sstables(table_id table_id, std::vector<sstables::shared_sstable> ssts) {
std::ranges::copy_if(std::move(ssts), std::back_inserter(_staging_sstables[table_id]), [] (auto& sst) {
return sst->state() == sstables::sstable_state::staging;
});
}
void view_building_worker::cleanup_staging_sstables(locator::effective_replication_map_ptr erm, table_id table_id, locator::tablet_id tid) {
auto& tablet_map = erm->get_token_metadata().tablets().get_tablet_map(table_id);
auto tablet_range = tablet_map.get_token_range(tid);
auto [first, last] = std::ranges::remove_if(_staging_sstables[table_id], [&] (auto& sst) {
auto sst_last_token = sst->get_last_decorated_key().token();
return tablet_range.contains(sst_last_token, dht::token_comparator());
});
_staging_sstables[table_id].erase(first, last);
}
future<view_building_state> view_building_worker::get_latest_view_building_state(raft::term_t term) {
return smp::submit_to(0, [&sharded_vbw = container(), term] () -> future<view_building_state> {
auto& vbw = sharded_vbw.local();
// auto guard = vbw._group0.client().start_operation(vbw._as);
auto& raft_server = vbw._group0.group0_server();
auto group0_holder = vbw._group0.hold_group0_gate();
co_await raft_server.read_barrier(&vbw._as);
if (raft_server.get_current_term() != term) {
throw std::runtime_error(fmt::format("Invalid raft term. Got {} but current term is {}", term, raft_server.get_current_term()));
}
co_return vbw._vb_state_machine.building_state;
});
}
future<std::vector<utils::UUID>> view_building_worker::work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids) {
auto collect_completed_tasks = [&] {
std::vector<utils::UUID> completed;
for (auto& id: ids) {
if (_state.completed_tasks.contains(id)) {
completed.push_back(id);
}
}
return completed;
};
auto lock = co_await get_units(_state._mutex, 1, _as);
// Firstly check if there is any batch that is finished but wasn't cleaned up.
if (_state._batch && _state._batch->promise.available()) {
co_await _state.clean_up_after_batch();
}
// Check if tasks were already completed.
// If only part of the tasks were finished, return the subset and don't execute the remaining tasks.
std::vector<utils::UUID> completed = collect_completed_tasks();
if (!completed.empty()) {
co_return completed;
}
lock.return_all();
auto building_state = co_await get_latest_view_building_state(term);
lock = co_await get_units(_state._mutex, 1, _as);
co_await _state.update_processing_base_table(_db, building_state, _as);
// If there is no running batch, create it.
if (!_state._batch) {
if (!_state.processing_base_table) {
throw std::runtime_error("view_building_worker::state::processing_base_table needs to be set to work on view building");
}
auto my_host_id = _db.get_token_metadata().get_topology().my_host_id();
auto my_replica = locator::tablet_replica{my_host_id, this_shard_id()};
std::unordered_map<utils::UUID, view_building_task> tasks;
for (auto& id: ids) {
auto task_opt = building_state.get_task(*_state.processing_base_table, my_replica, id);
if (!task_opt) {
throw std::runtime_error(fmt::format("Task {} was not found for base table {} on replica {}", id, *building_state.currently_processed_base_table, my_replica));
}
tasks.insert({id, *task_opt});
}
#ifdef SEASTAR_DEBUG
auto& some_task = tasks.begin()->second;
for (auto& [_, t]: tasks) {
SCYLLA_ASSERT(t.base_id == some_task.base_id);
SCYLLA_ASSERT(t.last_token == some_task.last_token);
SCYLLA_ASSERT(t.replica == some_task.replica);
SCYLLA_ASSERT(t.type == some_task.type);
SCYLLA_ASSERT(t.replica.shard == this_shard_id());
}
#endif
// If any view was added after we did the initial flush, we need to do it again
if (std::ranges::any_of(tasks | std::views::values, [&] (const view_building_task& t) {
return t.view_id && !_state.flushed_views.contains(*t.view_id);
})) {
co_await _state.flush_base_table(_db, *_state.processing_base_table, _as);
}
// Create and start the batch
_state._batch = std::make_unique<batch>(container(), std::move(tasks), *building_state.currently_processed_base_table, my_replica);
_state._batch->start();
}
if (std::ranges::all_of(ids, [&] (auto& id) { return !_state._batch->tasks.contains(id); })) {
throw std::runtime_error(fmt::format(
"None of the tasks requested to work on is executed in current view building batch. Batch executes: {}, the RPC requested: {}",
_state._batch->tasks | std::views::keys, ids));
}
auto batch_future = _state._batch->promise.get_shared_future();
lock.return_all();
co_await std::move(batch_future);
lock = co_await get_units(_state._mutex, 1, _as);
co_await _state.clean_up_after_batch();
co_return collect_completed_tasks();
}
}
}

View File

@@ -14,7 +14,9 @@
#include <seastar/core/shared_future.hh>
#include <unordered_map>
#include <unordered_set>
#include "locator/abstract_replication_strategy.hh"
#include "locator/tablets.hh"
#include "raft/raft.hh"
#include "seastar/core/gate.hh"
#include "db/view/view_building_state.hh"
#include "sstables/shared_sstable.hh"
@@ -30,7 +32,7 @@ class messaging_service;
}
namespace service {
class raft_group0_client;
class raft_group0;
}
namespace db {
@@ -64,27 +66,16 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
*
* When `work` future is finished, it means all tasks in `tasks_ids` are done.
*
* The batch lives on shard 0 exclusively.
* When the batch starts to execute its tasks, it firstly copies all necessary data
* to the designated shard, then the work is done on the local copy of the data only.
* The batch lives on shard, where its executing its work exclusively.
*/
enum class batch_state {
idle,
in_progress,
finished,
};
class batch {
public:
batch_state state = batch_state::idle;
table_id base_id;
locator::tablet_replica replica;
std::unordered_map<utils::UUID, view_building_task> tasks;
shared_future<> work;
condition_variable batch_done_cv;
// The abort has to be used only on `replica.shard`
shared_promise<> promise;
future<> work = make_ready_future();
abort_source as;
batch(sharded<view_building_worker>& vbw, std::unordered_map<utils::UUID, view_building_task> tasks, table_id base_id, locator::tablet_replica replica);
@@ -100,34 +91,18 @@ class view_building_worker : public seastar::peering_sharded_service<view_buildi
friend class batch;
struct local_state {
struct state {
std::optional<table_id> processing_base_table = std::nullopt;
// Stores ids of views for which the flush was done.
// When a new view is created, we need to flush the base table again,
// as data might be inserted.
std::unordered_set<utils::UUID> completed_tasks;
std::unique_ptr<batch> _batch = nullptr;
std::unordered_set<table_id> flushed_views;
std::unordered_map<utils::UUID, shared_ptr<batch>> tasks_map;
std::unordered_set<utils::UUID> finished_tasks;
std::unordered_set<utils::UUID> aborted_tasks;
condition_variable state_updated_cv;
// Clears completed/aborted tasks and creates batches (without starting them) for started tasks.
// Returns a map of tasks per shard to execute.
future<> update(view_building_worker& vbw);
future<> finish_completed_tasks();
// The state can be aborted if, for example, a view is dropped, then all its tasks
// are aborted and the coordinator may choose new base table to process.
// This method aborts all batches as we stop to processing the current base table.
future<> clear_state();
// Flush table with `table_id` on all shards.
// This method should be used only on currently processing base table and
// it updates `flushed_views` field.
future<> flush_table(view_building_worker& vbw, table_id table_id);
semaphore _mutex = semaphore(1);
// All of the methods below should be executed while holding `_mutex` unit!
future<> update_processing_base_table(replica::database& db, const view_building_state& building_state, abort_source& as);
future<> flush_base_table(replica::database& db, table_id base_table_id, abort_source& as);
future<> clean_up_after_batch();
future<> clear();
};
// Wrapper which represents information needed to create
@@ -145,28 +120,28 @@ private:
replica::database& _db;
db::system_keyspace& _sys_ks;
service::migration_notifier& _mnotifier;
service::raft_group0_client& _group0_client;
service::raft_group0& _group0;
view_update_generator& _vug;
netw::messaging_service& _messaging;
view_building_state_machine& _vb_state_machine;
abort_source _as;
named_gate _gate;
local_state _state;
state _state;
std::unordered_set<table_id> _views_in_progress;
future<> _view_building_state_observer = make_ready_future<>();
condition_variable _sstables_to_register_event;
semaphore _staging_sstables_mutex = semaphore(1);
std::unordered_map<table_id, std::vector<staging_sstable_task_info>> _sstables_to_register;
std::unordered_map<table_id, std::unordered_map<dht::token, std::vector<sstables::shared_sstable>>> _staging_sstables;
std::unordered_map<table_id, std::vector<sstables::shared_sstable>> _staging_sstables;
future<> _staging_sstables_registrator = make_ready_future<>();
public:
view_building_worker(replica::database& db, db::system_keyspace& sys_ks, service::migration_notifier& mnotifier,
service::raft_group0_client& group0_client, view_update_generator& vug, netw::messaging_service& ms,
service::raft_group0& group0, view_update_generator& vug, netw::messaging_service& ms,
view_building_state_machine& vbsm);
void start_background_fibers();
future<> init();
future<> register_staging_sstable_tasks(std::vector<sstables::shared_sstable> ssts, table_id table_id);
@@ -177,11 +152,17 @@ public:
virtual void on_update_view(const sstring& ks_name, const sstring& view_name, bool columns_changed) override {};
virtual void on_drop_view(const sstring& ks_name, const sstring& view_name) override;
// Used ONLY to load staging sstables migrated during intra-node tablet migration.
void load_sstables(table_id table_id, std::vector<sstables::shared_sstable> ssts);
// Used in cleanup/cleanup-target tablet transition stage
void cleanup_staging_sstables(locator::effective_replication_map_ptr erm, table_id table_id, locator::tablet_id tid);
private:
future<view_building_state> get_latest_view_building_state(raft::term_t term);
future<> check_for_aborted_tasks();
future<> run_view_building_state_observer();
future<> update_built_views();
future<> update_building_state();
bool is_shard_free(shard_id shard);
dht::token_range get_tablet_token_range(table_id table_id, dht::token last_token);
future<> do_build_range(table_id base_id, std::vector<table_id> views_ids, dht::token last_token, abort_source& as);
@@ -195,7 +176,7 @@ private:
void init_messaging_service();
future<> uninit_messaging_service();
future<std::vector<view_task_result>> work_on_tasks(std::vector<utils::UUID> ids);
future<std::vector<utils::UUID>> work_on_tasks(raft::term_t term, std::vector<utils::UUID> ids);
};
}

View File

@@ -102,13 +102,13 @@ view_update_generator::view_update_generator(replica::database& db, sharded<serv
, _early_abort_subscription(as.subscribe([this] () noexcept { do_abort(); }))
{
setup_metrics();
discover_staging_sstables();
_db.plug_view_update_generator(*this);
}
view_update_generator::~view_update_generator() {}
future<> view_update_generator::start() {
discover_staging_sstables();
_started = seastar::async([this]() mutable {
auto drop_sstable_references = defer([&] () noexcept {
// Clear sstable references so sstables_manager::stop() doesn't hang.

View File

@@ -605,8 +605,8 @@ public:
}
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "versions");
return schema_builder(system_keyspace::NAME, "versions", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::VERSIONS);
return schema_builder(system_keyspace::NAME, system_keyspace::VERSIONS, std::make_optional(id))
.with_column("key", utf8_type, column_kind::partition_key)
.with_column("version", utf8_type)
.with_column("build_mode", utf8_type)
@@ -1206,8 +1206,8 @@ public:
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_timestamps");
return schema_builder(system_keyspace::NAME, "cdc_timestamps", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_TIMESTAMPS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", reversed_type_impl::get_instance(timestamp_type), column_kind::clustering_key)
@@ -1278,7 +1278,7 @@ public:
static_assert(int(cdc::stream_state::current) < int(cdc::stream_state::closed));
static_assert(int(cdc::stream_state::closed) < int(cdc::stream_state::opened));
co_await _ss.query_cdc_streams(table, [&] (db_clock::time_point ts, const std::vector<cdc::stream_id>& current, cdc::cdc_stream_diff diff) -> future<> {
co_await _ss.query_cdc_streams(table, [&] (db_clock::time_point ts, const utils::chunked_vector<cdc::stream_id>& current, cdc::cdc_stream_diff diff) -> future<> {
co_await emit_stream_set(ts, cdc::stream_state::current, current);
co_await emit_stream_set(ts, cdc::stream_state::closed, diff.closed_streams);
co_await emit_stream_set(ts, cdc::stream_state::opened, diff.opened_streams);
@@ -1289,8 +1289,8 @@ public:
}
private:
static schema_ptr build_schema() {
auto id = generate_legacy_id(system_keyspace::NAME, "cdc_streams");
return schema_builder(system_keyspace::NAME, "cdc_streams", std::make_optional(id))
auto id = generate_legacy_id(system_keyspace::NAME, system_keyspace::CDC_STREAMS);
return schema_builder(system_keyspace::NAME, system_keyspace::CDC_STREAMS, std::make_optional(id))
.with_column("keyspace_name", utf8_type, column_kind::partition_key)
.with_column("table_name", utf8_type, column_kind::partition_key)
.with_column("timestamp", timestamp_type, column_kind::clustering_key)

View File

@@ -204,7 +204,7 @@ ring_position_range_sharder::next(const schema& s) {
return ring_position_range_and_shard{std::move(_range), shard};
}
ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges)
ring_position_range_vector_sharder::ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges)
: _ranges(std::move(ranges))
, _sharder(sharder)
, _current_range(_ranges.begin()) {

View File

@@ -11,6 +11,7 @@
#include "dht/ring_position.hh"
#include "dht/token-sharding.hh"
#include "utils/interval.hh"
#include "utils/chunked_vector.hh"
#include <vector>
@@ -89,7 +90,7 @@ struct ring_position_range_and_shard_and_element : ring_position_range_and_shard
//
// During migration uses a view on shard routing for reads.
class ring_position_range_vector_sharder {
using vec_type = dht::partition_range_vector;
using vec_type = utils::chunked_vector<dht::partition_range>;
vec_type _ranges;
const sharder& _sharder;
vec_type::iterator _current_range;
@@ -104,7 +105,7 @@ public:
// Initializes the `ring_position_range_vector_sharder` with the ranges to be processesd.
// Input ranges should be non-overlapping (although nothing bad will happen if they do
// overlap).
ring_position_range_vector_sharder(const sharder& sharder, dht::partition_range_vector ranges);
ring_position_range_vector_sharder(const sharder& sharder, utils::chunked_vector<dht::partition_range> ranges);
// Fetches the next range-shard mapping. When the input range is exhausted, std::nullopt is
// returned. Within an input range, results are contiguous and non-overlapping (but since input
// ranges usually are discontiguous, overall the results are not contiguous). Together, the results

View File

@@ -30,6 +30,31 @@ enum class token_kind {
after_all_keys,
};
// Represents a token for partition keys.
// Has a disengaged state, which sorts before all engaged states.
struct raw_token {
int64_t value;
/// Constructs a disengaged token.
raw_token() : value(std::numeric_limits<int64_t>::min()) {}
/// Constructs an engaged token.
/// The token must be of token_kind::key kind.
explicit raw_token(const token&);
explicit raw_token(int64_t v) : value(v) {};
std::strong_ordering operator<=>(const raw_token& o) const noexcept = default;
std::strong_ordering operator<=>(const token& o) const noexcept;
/// Returns true iff engaged.
explicit operator bool() const noexcept {
return value != std::numeric_limits<int64_t>::min();
}
};
using raw_token_opt = seastar::optimized_optional<raw_token>;
class token {
// INT64_MIN is not a legal token, but a special value used to represent
// infinity in token intervals.
@@ -52,6 +77,10 @@ public:
constexpr explicit token(int64_t d) noexcept : token(kind::key, normalize(d)) {}
token(raw_token raw) noexcept
: token(raw ? kind::key : kind::before_all_keys, raw.value)
{ }
// This constructor seems redundant with the bytes_view constructor, but
// it's necessary for IDL, which passes a deserialized_bytes_proxy here.
// (deserialized_bytes_proxy is convertible to bytes&&, but not bytes_view.)
@@ -223,6 +252,29 @@ public:
}
};
inline
raw_token::raw_token(const token& t)
: value(t.raw())
{
#ifdef DEBUG
assert(t._kind == token::kind::key);
#endif
}
inline
std::strong_ordering raw_token::operator<=>(const token& o) const noexcept {
switch (o._kind) {
case token::kind::after_all_keys:
return std::strong_ordering::less;
case token::kind::before_all_keys:
// before_all_keys has a raw value set to the same raw value as a disengaged raw_token, and sorts before all keys.
// So we can order them by just comparing raw values.
[[fallthrough]];
case token::kind::key:
return value <=> o._data;
}
}
inline constexpr std::strong_ordering tri_compare_raw(const int64_t l1, const int64_t l2) noexcept {
if (l1 == l2) {
return std::strong_ordering::equal;
@@ -329,6 +381,17 @@ struct fmt::formatter<dht::token> : fmt::formatter<string_view> {
}
};
template <>
struct fmt::formatter<dht::raw_token> : fmt::formatter<string_view> {
template <typename FormatContext>
auto format(const dht::raw_token& t, FormatContext& ctx) const {
if (!t) {
return fmt::format_to(ctx.out(), "null");
}
return fmt::format_to(ctx.out(), "{}", t.value);
}
};
namespace std {
template<>

View File

@@ -131,6 +131,28 @@ def configure_iotune_open_fd_limit(shards_count):
logging.error(f"Required FDs count: {precalculated_fds_count}, default limit: {fd_limits}!")
sys.exit(1)
def force_random_request_size_of_4k():
"""
It is a known bug that on i4i, i7i, i8g, i8ge instances, the disk controller reports the wrong
physical sector size as 512bytes, but the actual physical sector size is 4096bytes. This function
helps us work around that issue until AWS manages to get a fix for it. It returns 4096 if it
detect it's running on one of the affected instance types, otherwise it returns None and IOTune
will use the physical sector size reported by the disk.
"""
path="/sys/devices/virtual/dmi/id/product_name"
try:
with open(path, "r") as f:
instance_type = f.read().strip()
except FileNotFoundError:
logging.warning(f"Couldn't find {path}. Falling back to IOTune using the physical sector size reported by disk.")
return
prefixes = ["i7i", "i4i", "i8g", "i8ge"]
if any(instance_type.startswith(p) for p in prefixes):
return 4096
def run_iotune():
if "SCYLLA_CONF" in os.environ:
conf_dir = os.environ["SCYLLA_CONF"]
@@ -173,6 +195,8 @@ def run_iotune():
configure_iotune_open_fd_limit(cpudata.nr_shards())
if (reqsize := force_random_request_size_of_4k()):
iotune_args += ["--random-write-io-buffer-size", f"{reqsize}"]
try:
subprocess.check_call([bindir() + "/iotune",
"--format", "envfile",

View File

@@ -17,6 +17,7 @@ import stat
import logging
import pyudev
import psutil
import platform
from pathlib import Path
from scylla_util import *
from subprocess import run, SubprocessError
@@ -102,6 +103,21 @@ def is_selinux_enabled():
return True
return False
def is_kernel_version_at_least(major, minor):
"""Check if the Linux kernel version is at least major.minor"""
try:
kernel_version = platform.release()
# Extract major.minor from version string like "5.15.0-56-generic"
version_parts = kernel_version.split('.')
if len(version_parts) >= 2:
kernel_major = int(version_parts[0])
kernel_minor = int(version_parts[1])
return (kernel_major, kernel_minor) >= (major, minor)
except (ValueError, IndexError):
# If we can't parse the version, assume older kernel for safety
pass
return False
if __name__ == '__main__':
if os.getuid() > 0:
print('Requires root permission.')
@@ -231,8 +247,17 @@ if __name__ == '__main__':
# see https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c .
# and it also cannot be smaller than the sector size.
block_size = max(1024, sector_size)
run('udevadm settle', shell=True, check=True)
run(f'mkfs.xfs -b size={block_size} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
# On Linux 5.12+, sub-block overwrites are supported well, so keep the default block
# size, which will play better with the SSD.
if is_kernel_version_at_least(5, 12):
block_size_opt = ""
else:
block_size_opt = f"-b size={block_size}"
run(f'mkfs.xfs {block_size_opt} {fsdev} -K -m rmapbt=0 -m reflink=0', shell=True, check=True)
run('udevadm settle', shell=True, check=True)
if is_debian_variant():

View File

@@ -1 +1 @@
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"
SCYLLA_NODE_EXPORTER_ARGS="--collector.interrupts --collector.ethtool.metrics-include='(bw_in_allowance_exceeded|bw_out_allowance_exceeded|conntrack_allowance_exceeded|conntrack_allowance_available|linklocal_allowance_exceeded)' --collector.ethtool --no-collector.hwmon --no-collector.bcache --no-collector.btrfs --no-collector.fibrechannel --no-collector.infiniband --no-collector.ipvs --no-collector.nfs --no-collector.nfsd --no-collector.powersupplyclass --no-collector.rapl --no-collector.tapestats --no-collector.thermal_zone --no-collector.udp_queues --no-collector.zfs"

View File

@@ -31,4 +31,5 @@ def parse():
parser.add_argument('--replace-address-first-boot', default=None, dest='replaceAddressFirstBoot', help="[[deprecated]] IP address of a dead node to replace.")
parser.add_argument('--dc', default=None, dest='dc', help="The datacenter name for this node, for use with the snitch GossipingPropertyFileSnitch.")
parser.add_argument('--rack', default=None, dest='rack', help="The rack name for this node, for use with the snitch GossipingPropertyFileSnitch.")
parser.add_argument('--blocked-reactor-notify-ms', default='25', dest='blocked_reactor_notify_ms', help="Set the blocked reactor notification timeout in milliseconds. Defaults to 25.")
return parser.parse_known_args()

View File

@@ -97,7 +97,9 @@ bcp LICENSE-ScyllaDB-Source-Available.md /licenses/
run microdnf clean all
run microdnf --setopt=tsflags=nodocs -y update
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip
run microdnf --setopt=tsflags=nodocs -y install hostname kmod procps-ng python3 python3-pip cpio
# Extract only systemctl binary from systemd package to avoid installing the whole systemd in the container.
run bash -rc "microdnf download systemd && rpm2cpio systemd-*.rpm | cpio -idmv ./usr/bin/systemctl && rm -rf systemd-*.rpm"
run curl -L --output /etc/yum.repos.d/scylla.repo ${repo_file_url}
run pip3 install --no-cache-dir --prefix /usr supervisor
run bash -ec "echo LANG=C.UTF-8 > /etc/locale.conf"
@@ -106,6 +108,8 @@ run bash -ec "cat /scylla_bashrc >> /etc/bash.bashrc"
run mkdir -p /var/log/scylla
run chown -R scylla:scylla /var/lib/scylla
run sed -i -e 's/^SCYLLA_ARGS=".*"$/SCYLLA_ARGS="--log-to-syslog 0 --log-to-stdout 1 --network-stack posix"/' /etc/sysconfig/scylla-server
# Cleanup packages not needed in the final image and clean package manager cache to reduce image size.
run bash -rc "microdnf remove -y cpio && microdnf clean all"
run mkdir -p /opt/scylladb/supervisor
run touch /opt/scylladb/SCYLLA-CONTAINER-FILE

View File

@@ -46,6 +46,7 @@ class ScyllaSetup:
self._extra_args = extra_arguments
self._dc = arguments.dc
self._rack = arguments.rack
self._blocked_reactor_notify_ms = arguments.blocked_reactor_notify_ms
def _run(self, *args, **kwargs):
logging.info('running: {}'.format(args))
@@ -205,7 +206,7 @@ class ScyllaSetup:
elif self._replaceAddressFirstBoot is not None:
args += ["--replace-address-first-boot %s" % self._replaceAddressFirstBoot]
args += ["--blocked-reactor-notify-ms 999999999"]
args += ["--blocked-reactor-notify-ms %s" % self._blocked_reactor_notify_ms]
with open("/etc/scylla.d/docker.conf", "w") as cqlshrc:
cqlshrc.write("SCYLLA_DOCKER_ARGS=\"%s\"\n" % (" ".join(args) + " " + " ".join(self._extra_args)))

View File

@@ -1,16 +1,25 @@
{
"Linux Distributions": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9", "10"],
"Amazon Linux": ["2023"]
},
"ScyllaDB Versions": [
{
"version": "ScyllaDB 2025.4",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9", "10"],
"Amazon Linux": ["2023"]
}
},
{
"version": "ScyllaDB 2025.3",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9", "10"],
"Amazon Linux": ["2023"]
}
@@ -19,7 +28,7 @@
"version": "ScyllaDB 2025.2",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9"],
"Amazon Linux": ["2023"]
}
@@ -28,7 +37,7 @@
"version": "ScyllaDB 2025.1",
"supported_OS": {
"Ubuntu": ["22.04", "24.04"],
"Debian": ["11"],
"Debian": ["11", "12"],
"Rocky / CentOS / RHEL": ["8", "9"],
"Amazon Linux": ["2023"]
}

View File

@@ -1,6 +1,18 @@
### a dictionary of redirections
#old path: new path
# Move the diver information to another project
/stable/using-scylla/drivers/index.html: https://docs.scylladb.com/stable/drivers/index.html
/stable/using-scylla/drivers/dynamo-drivers/index.html: https://docs.scylladb.com/stable/drivers/dynamo-drivers.html
/stable/using-scylla/drivers/cql-drivers/index.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-python-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-java-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-go-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-gocqlx-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-cpp-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
/stable/using-scylla/drivers/cql-drivers/scylla-rust-driver.html: https://docs.scylladb.com/stable/drivers/cql-drivers.html
# Redirect 2025.1 upgrade guides that are not on master but were indexed by Google (404 reported)
/master/upgrade/upgrade-guides/upgrade-guide-from-2024.x-to-2025.1/upgrade-guide-from-2024.x-to-2025.1.html: https://docs.scylladb.com/manual/stable/upgrade/index.html

View File

@@ -134,10 +134,6 @@ want modify a non-top-level attribute directly (e.g., a.b[3].c) need RMW:
Alternator implements such requests by reading the entire top-level
attribute a, modifying only a.b[3].c, and then writing back a.
Currently, Alternator doesn't use Tablets. That's because Alternator relies
on LWT (lightweight transactions), and LWT is not supported in keyspaces
with Tablets enabled.
```{eval-rst}
.. toctree::
:maxdepth: 2

View File

@@ -109,6 +109,32 @@ to do what, configure the following in ScyllaDB's configuration:
alternator_enforce_authorization: true
```
Note: switching `alternator_enforce_authorization` from `false` to `true`
before the client application has the proper secret keys and permission
tables set up will cause the application's requests to immediately fail.
Therefore, we recommend to begin by keeping `alternator_enforce_authorization`
set to `false` and setting `alternator_warn_authorization` to `true`.
This setting will continue to allow all requests without failing on
authentication or authorization errors - but will _count_ would-be
authentication and authorization failures in the two metrics:
* `scylla_alternator_authentication_failures`
* `scylla_alternator_authorization_failures`
`alternator_warn_authorization=true` also generates a WARN-level log message
on each authentication or authorization failure. These log messages each
includes the string `alternator_enforce_authorization=true`, and information
that can help pinpoint the source of the error - such as the username
involved in the attempt, and the address of the client sending the request.
When you see that both metrics are not increasing (or, alternatively, that no
more log messages appear), you can be sure that the application is properly
set up and can finally set `alternator_enforce_authorization` to `true`.
You can leave `alternator_warn_authorization` set or unset, depending on
whether or not you want to see log messages when requests fail on
authentication/authorization (in any case, the metric counts these failures,
and the client will also get the error).
Alternator implements the same [signature protocol](https://docs.aws.amazon.com/general/latest/gr/signature-version-4.html)
as DynamoDB and the rest of AWS. Clients use, as usual, an access key ID and
a secret access key to prove their identity and the authenticity of their

Some files were not shown because too many files have changed in this diff Show More