Commit Graph

49369 Commits

Author SHA1 Message Date
Nadav Har'El
b4e3d4ac2f alternator: nicer error message for integer overflow in list index
In the DynamoDB API, when "a" is a list attribute, a[999] returns the
1000th element. But if the list isn't that long (e.g., it only has 5
elements), a[999] returns nothing - it's not an error.

But it turns out that when the index is so long that it can't even be
parsed as an integer, e.g., 99999999999999, DynamoDB does report an
error:

    Invalid ProjectionExpression: List index is not within the
    allowable range; index: [99999999999999]

Before this patch, Alternator also returned an error in this case,
with the right type (ValidationException), but with a strange low-level
error text:

    Failed parsing ProjectionExpression 'a[99999999999999]':
    std::out_of_range (stoi)

The problem was that the code (in alternator/expressions.g) ran stoi()
without converting its std::out_of_range exception to a better user-facing
message. We do this in this patch, and the error message now looks like:

    Failed parsing ProjectionExpression 'a[99999999999999]':
    list index out of integer range

This patch also includes a test reproducing this error, which passes
on DynamDB and on Alternator it fails before this patch and passes with
the patch.

Fixes #25947

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25951
2025-09-15 08:43:00 +03:00
Nadav Har'El
208d3986a7 alternator: add explanation of internal tags
Alternator needs to store a few pieces of information for each table
that it can't store in the existing CQL schema. We decided to store
this information in hidden tags - tags named with the prefix "system:" -
and we already have four of those: Provisioned RCU and WCU, table
creation time, and TTL's expiration-time attribute.

This patch moves the definition of all four tags to one place in
executor.cc, adds a short comment about the content of each tag,
and adds a longer comment explaining why we have these hidden tags
at all.

It is expected that more hidden tags will follow - e.g., to solve
issue #5320. So we expect more tags to be added later in the same
place in the code.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25980
2025-09-15 08:41:39 +03:00
Emil Maskovsky
99db980899 gossiper: eliminate duplicate code in do_shadow_round
Remove a redundant code block inadvertently introduced in commit 4b3d160f34.
While the duplicate did not affect functionality, its presence could
cause confusion and maintenance issues.

This change does not alter behavior and is purely a cleanup.

Fixes: scylladb/scylladb#25999

Backport: The issue exists in all 2025 branches, so it should be
backported accordingly.

Closes scylladb/scylladb#26001
2025-09-15 08:35:04 +03:00
Jenkins Promoter
c63b335819 Update pgo profiles - aarch64 2025-09-15 05:17:07 +03:00
Jenkins Promoter
e97a0c8b42 Update pgo profiles - x86_64 2025-09-14 21:23:37 -04:00
Aleksandra Martyniuk
75b772adfb db: optimize cache invalidation following repair/streaming
Currently, if a new sstable is created during repair/streaming,
we invalidate its whole	token range in cache. If the sstable
is sparse, we unnecessarily clear too much data.

Modify cache invalidation, so that only the partitions present
in the sstable are cleared.

To check whether a partition is present in the sstable, we use bloom
filters. Bloom filters may return false positives and show that
an sstable contains a partition, even though it does not. Due to that
we may invalidate a bit more than we need to, but the cache will be
in valid state.

An issue arises when we do not invalidate two consecutive partitions
that are continuous. The sstable may contain a token that falls
between these partitions, breaking the continuity. To check that, we
would need to scan sstable index. However, such a change would
noticeably complicate the invalidation, both performance and code.
In this change, sstable index reader isn't used. Instead, the continuity
flag is unset for all scanned partitions. This comes at a cost of
heavier reads, as we will need to verify continuity when reading more
than one partition from cache.

Fixes: https://github.com/scylladb/scylladb/issues/9136.

Closes scylladb/scylladb#25996
2025-09-14 19:48:14 +03:00
Lakshmi Narayanan Sreethar
1d1e572962 sstables: skip bloom filter rebuilds with minimal savings
If a bloom filter was built with a bad partition estimate, it is rebuilt
right before the sstable is sealed. The rebuild is already skipped if
the current bitset size results in a false-positive rate within 75%–125%
of the configured value.

This patch adds additional conditions to prevent rebuilds when the
savings are minimal. It also skips rebuilding for garbage collected
sstables, since they will be dropped soon anyway.

Also updated and added more test cases to cover these new criteria for
bloom filter rebuilds.

Fixes #25464
Fixes #25468

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#25968
2025-09-14 18:19:50 +03:00
Nadav Har'El
5307d1b9a8 Merge 'vector_index: add version to index options' from Dawid Pawlik
Since creating the vector index does not lead to creation of a view table [#24438] (whose version info had been logged in `system_schema.scylla_tables`) we lacked the information about the version of the index.

The solution we arrived at is to add the version as a field in options column of `system_schema.indexes`.
It requires few changes and seems unintruitive for existing infrastructure.

This patch implements the solution described above.

Refs: VECTOR-142

Closes scylladb/scylladb#25614

* github.com:scylladb/scylladb:
  cqlpy/test_vector_index: add vector index version test
  vector_index, index_prop_defs: add version to index options
  create_index_statement: rename `validator` to `custom_index_factory`
  custom index: rename `custom_index_option_name`
  vector_index: rename `supported_options` to `vector_index_options`
2025-09-14 15:35:53 +03:00
Radosław Cybulski
30306c3375 Remove const & from tags_extension constructor
`tags_extension` constructor unnecesarily takes `std::map` by const ref,
forcing a copy. This patch removes const ref for performance reasons.

Closes scylladb/scylladb#25977
2025-09-14 13:32:21 +03:00
Ernest Zaslavsky
44d34663bc treewide: seastar module update and fix broken rest client
start using `write_body` in `rest/client` to properly set headers due to changes applied to seastar's http client

Seastar module update
```
b6be384e Merge 'http: generalize Content-Type setting' from Nadav Har'El
74472298 http: generalize request's Content-Type setting
9fd5a1cc http: generalize reply's Content-Type setting
a2665f38 memory: Remove deprecated enable_abort_on_allocation_failure()
d2a5a8a9 resource.cc: Remove some dead code
7ad9f424 http: Add support of multiple key repetitions for the request
a636baca task: Move task::get_backtrace() definition in its class
a0101efa Fixed "doxygen" spelling in error message
db969482 Merge 'http/reply: introduce set_cookie()' from Botond Dénes
5357b434 http/reply: introduce set_cookie()
1ddcf05f http/reply: make write_reply*() public
4b782d73 http/connection: start_response(): fix indentation
720feca0 http/reply: encapsulate reply writing in write_reply()
3e19917d Merge 'exceptions: log thrown and propagated exception with distinct log levels' from Botond Dénes
db9aea93 Merge 'Correctly wrap up abandoned yielding directory lister' from Pavel Emelyanov
dbb2bf3f test: Add test for input_stream::read_exactly()
a5308ec9 file/directory_lister: Correctly wrap up fallback generator
4f0811f4 file/directory_lister: Convert on-stack queue to shared pointer
59801da7 tests: Add directory lister early drop cases
33233032 http/reply: s/write_reply_to_connection/write_reply/
69b93620 http/reply: write_reply_{to_connection,headers}(): pass output stream
56e9bda7 test: Convert directory_test into seastar test
96782358 Merge 'Improve io_tester's seqwrite and append workloads' from Pavel Emelyanov
8b46e3d4 SEASTAR_ASSERT: assert to stderr and flush stream
3370e22a tutorial.md: use current_exception_as_future()
e977453a Add fixture support for seastar::testing
3e70d7f7 io_tester: Do not set append_is_unlikely unconditionally
2a4ae7b4 io_tester: Count file size overflows
5e678bb5 io_tester: Tuneup size overflow check
d5dad8ce io_tester: Move position management code to io_class_data
5586a056 io_tester: Rename seqwrite -> overwrite
92df2fb2 io_tester: Relax return value of create_and_fill_file()
03d9500d io_tester: Dont fill file for APPEND
d6844a7b io_tester: Indentation fix after previous patch
fb9e0088 io_tester: Coroutinize create_and_fill_file()
2f802f57 exceptions: log thrown and propagated exception with distinct log levels
4971fa70 util: move log-level into own header
39448fc1 Merge 'Fix and tune http::request setup by client' from Pavel Emelyanov
52d0c4fb iostream: Move output_stream::write(scattered_message) lower
7a52f734 Merge 'read_first_line: Missing pragma and licence' from Ernest Zaslavsky
d0881b7e read_first_line: Add missing license boilerplate
988a0e99 read_first_line:: Add missing `#pragma once`
42675266 http: Make client::make_request accept const request&
c7709fb5 http: Make request making API return exceptional future not throw
b68ed89b http: Move request content length header setup
1d96dac6 http: Move request version configuration
072e86f6 http: Setup request once
```

Closes scylladb/scylladb#25915
2025-09-13 17:14:28 +03:00
Asias He
9bca90be0d repair: Fix repair_row_level_stop verb idl
The version keyword is missed for the optional mark_as_repaired
parameter. This causes the new node to expect more data to come:

INFO  2025-09-01 19:23:05,332 [shard 0:strm] rpc - client
127.0.7.6:50116 msg_id 8:  caught exception while processing a message:
std::out_of_range (deserialization buffer underflow)

When the sender is an old node in a mixed cluster, the data will never
come. To fix, add the missing version keyword.

Our idl-compiler.py should have caught the typo since the keyword
was missing in the [[]] tag.

Fixes #25666

Closes scylladb/scylladb#25782
2025-09-12 15:58:19 +03:00
Avi Kivity
ef7babda3d Merge 'test: deflake test_restart_leaving_replica_during_cleanup' from Patryk Jędrzejczak
The test started hitting #21779 recently. We deflake it in this commit
by disabling the tablet load balancing before dropping the keyspace at
the end of the test.

We still have to understand why the test started hitting #21779, so we
keep #25938 open.

Refs #25938

The test was flaky only on master, so no backport needed.

Closes scylladb/scylladb#25975

* github.com:scylladb/scylladb:
  test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup
  test: deflake test_restart_leaving_replica_during_cleanup
2025-09-12 15:58:19 +03:00
Sayanta Banerjee
6092520631 Small grammatical changes
Closes scylladb/scylladb#24667
2025-09-12 15:58:19 +03:00
Radosław Cybulski
436150eb52 treewide: fix spelling errors
Fix spelling errors reported by copilot on github.
Remove single use namespace alias.

Closes scylladb/scylladb#25960
2025-09-12 15:58:19 +03:00
Patryk Jędrzejczak
aaab71c14e test: enable load balancing on a single node in test_restart_leaving_replica_during_cleanup
Doing it on more than one node is redundant.
2025-09-11 13:19:56 +02:00
Patryk Jędrzejczak
4c9efc08d8 test: deflake test_restart_leaving_replica_during_cleanup
The test started hitting #21779 recently. We deflake it in this commit
by disabling the tablet load balancing before dropping the keyspace at
the end of the test.

We still have to understand why the test started hitting #21779, so we
keep #25938 open.

Refs #25938
2025-09-11 13:19:51 +02:00
Patryk Jędrzejczak
eae12c1717 test: cluster: add a test for restarts with no group 0 quorum
We don't have such a test, and we could add a group 0 quorum requirement
on the restart path by mistake.

A new test, no backport.

Closes scylladb/scylladb#25623
2025-09-11 08:56:34 +03:00
Raphael S. Carvalho
b607b1c284 compaction: Fix stop of sstable cleanup
The interface suggests the whole sstable cleanup is aborted with
'nodetool stop CLEANUP', but it is currently stopping only the
ongoing cleanup task, and the compaction manager will retry the
task since the error is not propagated all the way back to the
caller. With raft topology, the coordinator should retry it though
since cleanup became mandatory with automatic cleanup. So it's
only fixing the usage where cleanup is issued manually.

The stop exception is only propagated to the caller of cleanup.
When stopping tasks during shutdown, the exception is swallowed
and the error only returned to the caller.

Fixes #20823.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#24996
2025-09-11 08:55:10 +03:00
Yaron Kaikov
902d139c80 tools: toolchain: dbuild: add setuptools_scm as dependency
this package was added as a dependnancy to `cqlsh` in 216d8b0658

Fixes: https://github.com/scylladb/scylladb/issues/25613

[Yaron: regenerate frozen toolchain with optimized clang from
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-aarch64.tar.gz
	https://devpkg.scylladb.com/clang/clang-20.1.8-Fedora-42-x86_64.tar.gz
]

Closes scylladb/scylladb#25932
2025-09-11 08:51:28 +03:00
Cezar Moise
20ba8d4e8c test: skip flaky test test_one_big_mutation_corrupted_on_startup
The test is flaky since it tries to corrupt the commitlog
in a non-deterministic way that sometimes allows
the tested mutation to escape and be replayed anyhow.

refs: #25627

Closes scylladb/scylladb#25950
2025-09-11 08:39:24 +03:00
Avi Kivity
c91b326d5a Merge 'transport: replace throwing protocol_exception with returns' from Dario Mirovic
Replace throwing `protocol_exception` with returning it as a result or an exceptional future in the transport server module. The goal is to improve performance.

Most of the `protocol_exception` throws were made from `fragmented_temporary_buffer` module, by passing `exception_thrower()` to its `read*` methods. `fragmented_temporary_buffer` is changed so that it now accepts an exception creator, not exception thrower. `fragmented_temporary_buffer_concepts::ExceptionCreator` concept replaced `fragmented_temporary_buffer_concepts::ExceptionThrower` and all methods that have been throwing now return failed result of type `utils::result_with_eptr`. This change is then propagated to the callers.

The scope of this patch is `protocol_exception`, so commitlog just calls `.value()` method on the result. If the result failed, that will throw the exception from the result, as defined by `utils::result_with_eptr_throw_policy`. This means that the behavior of commitlog module stays the same.

transport server module handles results gracefully. All the caller functions that return non-future value `T` now return `utils::result_with_eptr<T>`. When the caller is a function that returns a future, and it receives failed result, `make_exception_future(std::move(failed_result).value())` is returned. The rest of the callstack up to the transport server `handle_error` function is already working without throwing, and that's how zero throws is achieved.

cql3 module changes do the same as transport server module.

Benchmark that is not yet merged has commit `67fbe35833e2d23a8e9c2dcb5e04580231d8ec96`, [GitHub diff view](https://github.com/scylladb/scylladb/compare/master...nuivall:scylladb:perf_cql_raw). It uses either read or write query.

Command line used:
```
./build/release/scylla perf-cql-raw --workdir ~/tmp/scylladir --smp 1 --developer-mode 1 --workload write --duration 300 --concurrency 1000 --username cassandra --password cassandra 2>/dev/null
```
The only thing changed across runs is `--workload write`/`--workload read`.

Built and run on `release` target.

<details>

```
throughput:
        mean=   36946.04 standard-deviation=1831.28
        median= 37515.49 median-absolute-deviation=1544.52
        maximum=39748.41 minimum=28443.36
instructions_per_op:
        mean=   108105.70 standard-deviation=965.19
        median= 108052.56 median-absolute-deviation=53.47
        maximum=124735.92 minimum=107899.00
cpu_cycles_per_op:
        mean=   70065.73 standard-deviation=2328.50
        median= 69755.89 median-absolute-deviation=1250.85
        maximum=92631.48 minimum=66479.36

⏱  real=5:11.08  user=2:00.20  sys=2:25.55  cpu=85%
```

```
throughput:
        mean=   40718.30 standard-deviation=2237.16
        median= 41194.39 median-absolute-deviation=1723.72
        maximum=43974.56 minimum=34738.16
instructions_per_op:
        mean=   117083.62 standard-deviation=40.74
        median= 117087.54 median-absolute-deviation=31.95
        maximum=117215.34 minimum=116874.30
cpu_cycles_per_op:
        mean=   58777.43 standard-deviation=1225.70
        median= 58724.65 median-absolute-deviation=776.03
        maximum=64740.54 minimum=55922.58

⏱  real=5:12.37  user=27.461  sys=3:54.53  cpu=83%
```

```
throughput:
        mean=   37107.91 standard-deviation=1698.58
        median= 37185.53 median-absolute-deviation=1300.99
        maximum=40459.85 minimum=29224.83
instructions_per_op:
        mean=   108345.12 standard-deviation=931.33
        median= 108289.82 median-absolute-deviation=55.97
        maximum=124394.65 minimum=108188.37
cpu_cycles_per_op:
        mean=   70333.79 standard-deviation=2247.71
        median= 69985.47 median-absolute-deviation=1212.65
        maximum=92219.10 minimum=65881.72

⏱  real=5:10.98  user=2:40.01  sys=1:45.84  cpu=85%
```

```
throughput:
        mean=   38353.12 standard-deviation=1806.46
        median= 38971.17 median-absolute-deviation=1365.79
        maximum=41143.64 minimum=32967.57
instructions_per_op:
        mean=   117270.60 standard-deviation=35.50
        median= 117268.07 median-absolute-deviation=16.81
        maximum=117475.89 minimum=117073.74
cpu_cycles_per_op:
        mean=   57256.00 standard-deviation=1039.17
        median= 57341.93 median-absolute-deviation=634.50
        maximum=61993.62 minimum=54670.77

⏱  real=5:12.82  user=4:10.79  sys=11.530  cpu=83%
```

This shows ~240 instructions per op increase for reads and ~180 instructions per op increase for writes.

Tests have been run multiple times, with almost identical results. Each run lasted 300 seconds. Number of operations executed is roughly 38k per second * 300 seconds = 11.4m ops.

Update:

I have repeated the benchmark with clean state - reboot computer, put in performance mode, rebuild, closed other apps that might affect CPU and disk usage.

run count: 5 times before and 5 times after the patch
duration: 300 seconds

Average write throughput median before patch: 41155.99
Average write throughput median after patch: 42193.22

Median absolute deviation is also lower now, with values in range 350-550, while the previous runs' values were in range 750-1350.

</details>

Built and run on `release` target.

<details>

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   14910.90 standard-deviation=477.72
        median= 14956.73 median-absolute-deviation=294.16
        maximum=16061.18 minimum=13198.68
instructions_per_op:
        mean=   659591.63 standard-deviation=495.85
        median= 659595.46 median-absolute-deviation=324.91
        maximum=661184.94 minimum=658001.49
cpu_cycles_per_op:
        mean=   213301.49 standard-deviation=2724.27
        median= 212768.64 median-absolute-deviation=1403.85
        maximum=225837.15 minimum=208110.12

⏱  real=5:19.26  user=5:00.22  sys=15.827  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   93345.45 standard-deviation=4499.00
        median= 93915.52 median-absolute-deviation=2764.41
        maximum=104343.64 minimum=79816.66
instructions_per_op:
        mean=   65556.11 standard-deviation=97.42
        median= 65545.11 median-absolute-deviation=71.51
        maximum=65806.75 minimum=65346.25
cpu_cycles_per_op:
        mean=   34160.75 standard-deviation=803.02
        median= 33927.16 median-absolute-deviation=453.08
        maximum=39285.19 minimum=32547.13

⏱  real=5:03.23  user=4:29.46  sys=29.255  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   206982.18 standard-deviation=15894.64
        median= 208893.79 median-absolute-deviation=9923.41
        maximum=232630.14 minimum=127393.34
instructions_per_op:
        mean=   35983.27 standard-deviation=6.12
        median= 35982.75 median-absolute-deviation=3.75
        maximum=36008.24 minimum=35952.14
cpu_cycles_per_op:
        mean=   17374.87 standard-deviation=985.06
        median= 17140.81 median-absolute-deviation=368.86
        maximum=26125.38 minimum=16421.99

⏱  real=5:01.23  user=4:57.88  sys=0.124  cpu=98%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false --bypass-cache 2>/dev/null

```
throughput:
        mean=   16198.26 standard-deviation=902.41
        median= 16094.02 median-absolute-deviation=588.58
        maximum=17890.10 minimum=13458.74
instructions_per_op:
        mean=   659752.73 standard-deviation=488.08
        median= 659789.16 median-absolute-deviation=334.35
        maximum=660881.69 minimum=658460.82
cpu_cycles_per_op:
        mean=   216070.70 standard-deviation=3491.26
        median= 215320.37 median-absolute-deviation=1678.06
        maximum=232396.48 minimum=209839.86

⏱  real=5:17.33  user=4:55.87  sys=18.425  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache false 2>/dev/null

```
throughput:
        mean=   97067.79 standard-deviation=2637.79
        median= 97058.93 median-absolute-deviation=1477.30
        maximum=106338.97 minimum=87457.60
instructions_per_op:
        mean=   65695.66 standard-deviation=58.43
        median= 65695.93 median-absolute-deviation=37.67
        maximum=65947.76 minimum=65547.05
cpu_cycles_per_op:
        mean=   34300.20 standard-deviation=704.66
        median= 34143.92 median-absolute-deviation=321.72
        maximum=38203.68 minimum=33427.46

⏱  real=5:03.22  user=4:31.56  sys=29.164  cpu=99%
```

./build/release/scylla perf-simple-query --smp 1 --duration 300 --concurrency 1000 --enable-cache true 2>/dev/null

```
throughput:
        mean=   223495.91 standard-deviation=6134.95
        median= 224825.90 median-absolute-deviation=3302.09
        maximum=234859.90 minimum=193209.69
instructions_per_op:
        mean=   35981.41 standard-deviation=3.16
        median= 35981.13 median-absolute-deviation=2.12
        maximum=35991.46 minimum=35972.55
cpu_cycles_per_op:
        mean=   17482.26 standard-deviation=281.82
        median= 17424.08 median-absolute-deviation=143.91
        maximum=19120.68 minimum=16937.43

⏱  real=5:01.23  user=4:58.54  sys=0.136  cpu=99%
```

</details>

Fixes: #24567

This PR is a continuation of #24738 [transport: remove throwing protocol_exception on connection start](https://github.com/scylladb/scylladb/pull/24738). This PR does not solve a burning issue, but is rather an improvement in the same direction. As it is just an enhancement, it should not be backported.

Closes scylladb/scylladb#25408

* github.com:scylladb/scylladb:
  test/cqlpy: add protocol exception tests
  test/cqlpy: `test_protocol_exceptions.py` refactor message frame building
  test/cqlpy: `test_protocol_exceptions.py` refactor duplicate code
  transport: replace `make_frame` throw with return result
  cql3: remove throwing `protocol_exception`
  transport: replace throw in validate_utf8 with result_with_exception_ptr return
  transport: replace throwing protocol_exception with returns
  utils: add result_with_exception_ptr
  test/cqlpy: add unknown compression algorithm test case
2025-09-10 21:54:15 +03:00
Yaron Kaikov
0a025d121f packaging: Add adduser as dependnacy
As `adduser` command is being used by `/var/lib/dpkg/info/scylla-server.postinst` and similar during rpm post-install.

Fixes: https://github.com/scylladb/scylladb/issues/23722

Closes scylladb/scylladb#25928
2025-09-10 21:51:25 +03:00
Avi Kivity
fc64333040 Merge 'sstables/trie: add BTI index readers and writers' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25506/
Next part: plugging the BTI index readers and writers into sstable readers and writers.

The new code added in this PR isn't used outside of tests yet, but it's posted as a separate PR for reviewability.

This series implements, on top of the key translation logic, and abstract trie writing and traversal logic, a writer and a reader of sstable index files (which map primary keys to positions in Data.db), as described in f16fb6765b/src/java/org/apache/cassandra/io/sstable/format/bti/BtiFormat.md.

Caveats:
1. I think the added test has reasonable coverage, but that depends on running it multiple times. (Though it shouldn't need more than a few runs to catch any bug it covers). It's somewhat awkward as a test meant for running in CI, it's better as something you run many times after a relevant change.
2. These readers and writers are intended to be compatible with Cassandra, but I did *NOT* do any compatibility testing. The writers and readers added here have only been tested against each other, not against Cassandra's readers and writers.
3. This didn't undergo any proper benchmarking and optimization work. I was doing some measurements in the past, but everything was rewritten so much since then that the my old measurements are effectively invalidated. Frankly I have no idea what the performance of all this branchy-branchy logic is now.

No backports needed, new functionality.

Closes scylladb/scylladb#25626

* github.com:scylladb/scylladb:
  test/manual: add bti_cassandra_compatibility_test
  test/lib/random_schema: add some constraints for generated uuid and time/date values
  test/lib/random_utils: add a variant of get_bytes which takes an `engine&`
  test/boost: add bti_index_test
  sstables/writer: add an accessor for the current write position in Data.db
  sstables/trie: introduce bti_index_reader
  sstables/trie: add bti_partition_index_writer.cc
  sstables/trie: add bti_row_index_writer.cc
  utils/bit_cast: add a new overload of write_unaligned()
  sstables/trie: add trie_writer::add_partial()
  sstables/consumer: add read_56()
  sstables/trie: make bti_node_reader::page_ptr copy-constructible
  sstables: extract abstract_index_reader from index_reader.hh to its own header
  sstables/trie: add an accessor to the file_writer under bti_node_sink
  sstables/types: make `deletion_time::operator tombstone()` const
  sstables/types: add sstables::deletion_time::make_live()
  sstables/trie: fix a special case in max_offset_from_child
  sstables/trie: handle `partition_region`s other than `clustered` in BTI position encoding
  sstables/trie: rewrite lcb_mismatch to handle fragment invalidation
  test/boost/bti_key_translation_test: fix a compilation error hidden behind `if constexpr`
2025-09-10 21:48:52 +03:00
Pavel Emelyanov
88a01308e7 api: Move /storage_service/keyspaces handler to database module
The handler uses database service, not storage_service, and should
belong to the corresponding API module from column_family.cc

Once moved, the handler can use captured sharded<database> reference and
forget about http_context::db.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25834
2025-09-10 17:01:11 +02:00
Nadav Har'El
ce4592d8fc Merge 'test: cluster: deflake consistency checks after decommission' from Patryk Jędrzejczak
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this PR.

Fixes #25809

This PR improves CI stability and changes only tests, so it should be
backported to all supported branches.

Closes scylladb/scylladb#25927

* github.com:scylladb/scylladb:
  test: cluster: deflake consistency checks after decommission
  test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
2025-09-10 17:57:02 +03:00
Dawid Pawlik
1ce76a6ca2 cqlpy/test_vector_index: add vector index version test
Test if the index version is the same as the base table version before
the index was created.
Test if recreating the index with the same parameters changes the version.
Test if altering the base table does not change the version.
Test if the user cannot specify the index version option by themself.
2025-09-10 15:19:36 +02:00
Dawid Pawlik
909a51e524 vector_index, index_prop_defs: add version to index options
Since creating the vector index does not lead to creation
of a view table [#24438] (whose version info had been logged in
`system_schema.scylla_tables`) we lack the information about
the version of the index.

The mentioned version is used to recognize the quick-drop-create
index with the same parameters that needs to be rebuild.

The case is mainly experienced while testing, benchmarking
or experimenting with Vector Search.
Nevertheless it is important to have it considered, as it is really
weird having seen that DROP and CREATE commands did not change
anything.
Although being nice "optimization" to use the same old index,
the rebuild feels more natural for the get-to-know-VS-users.
Should not change anything in a real production environment.

The solution we arrived at is to add the version as a field in
options column of `system_schema.indexes`.
The version of vector index is a base table's schema version
on which the index was created.
The table's schema version changes everytime a table is changed
meaning that CREATE INDEX or DROP INDEX statement also change it.
Every index has a different index version, so it allows to identify
them easily.

This patch implements the solution described above.
2025-09-10 15:16:54 +02:00
Michał Chojnowski
47c2d09c22 test/manual: add bti_cassandra_compatibility_test
Adds a heavy test which tests compatibility of BTI index files
between Cassandra and Scylla.
It's composed from a C++ part, used to read and write BTI files
with Scylla's readers and writers,
and a Python part which uses a Cassandra node and the C++
executable to make them read and write each other's files.

The stages of the test are:
1. Use the C++ part to generate a random BIG sstable, and matching
   BTI index files.
2. Import the BIG files into Cassandra, let it generate its own
   BTI index files.
3. Read both Scylla's BTI and Cassandra's BTI index files using
   the C++ part. Check that they return the right positions
   and tombstones for each partition and row.
4. Sneakily swap Cassandra's BTI files for Scylla's BTI files,
   and query Cassandra (via CQL) for each row. Check that each query returns
   the right result.

Not much can be inferred about the index via CQL queries,
so the check we are doing on Cassandra is relatively weak.
But in conjunction with the checks done on the Scylla part,
it's probably good enough.

The test is weird enough, and with heavy-enough dependencies
(it uses a podman container to run the Cassandra) that
ith has been put in test/manual.

To run the test, build
`build/$build_mode/test/manual/bti_cassandra_compatibility_test_g`,
and run `python test/manual/bti_cassandra_compatibility_test.py`.

Note: there's a lot of things that could go wrong in this test.
(E.g. file permission issues or port mapping issues due to the container
usage, incompatibilities between the Python driver and the random CQL
values generated by generate_random_mutations, etc).
I hope it works everywhere, but I only tested it on my machine,
running it inside the dbuild container.
2025-09-10 13:04:42 +02:00
Botond Dénes
514f59d157 tools/scylla-sstable: write: move to UUID generation
We are moving away from integer generations, so stop using them.
Also drop the --generation command-line parameter, UUID generations
don't have be provided by the caller, because random UUIDs will not
collide with each other. To help the caller still know what generation
the output sstable has (previously they provided it via --generation),
print the generation to stdout.

Closes scylladb/scylladb#25166
2025-09-10 13:47:26 +03:00
Nadav Har'El
5e7251cd40 secondary index: fix xfailing test to pass on Cassandra
We have an xfailing test test_secondary_index.py::test_limit_partition
which reproduces a Scylla bug in LIMIT when scanning a secondary index
(Refs #22158). The point of such a reproducer is to demonstrate the bug
by passing on Cassandra but failing on Scylla - yet this specific test
doesn't pass on Cassandra because it expects the wrong 3 out of 4
results to be returned:

The test begins with LIMIT 1 and sees the first result is (2,1), so we
expect when we increase the LIMIT to 3 to see more results from the same
partition (2) - and yet the test mistakenly expected the next results
to come from partition 1, which is not a reasonable expectation, and
doesn't happen in Cassandra (I checked both Cassandra 5 and 4).

After this patch, the test passes on Cassandra (I tried 4 and 5), and
continues to fail on Scylla - which returns 4 rows despite the LIMIT 3.

Note that it is debatable whether this test should insist at all
on which 3 items are returned by "LIMIT 3" - In Cassandra the ordering of
a SELECT with a secondary index is not well defined (see discussion in
Refs #23392). So an alternative implementation of this test would be
to just check that LIMIT 3 returns 3 items without insisting which:

    # In Cassandra the ordering of a SELECT with a secondary index is not
    # defined (see discussion in #23392), so we don't know which three
    # results to expect - just that it must be a 3-item subset.
    rows = list(rs)
    assert len(rows) == 3
    assert set(rows).issubset({(1,1), (1,2), (2,1), (2,2)})

However, as of yet, I did not modify this test to do this. I still
believe there is value in secondary index scans having the same order
as a scan without a secondary index has - and not an undefined order,
and if both Scylla and Cassandra implement that in practice, it's
useful for tests to validate this so we'll know if this guarantee is
ever broken.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#25676
2025-09-10 08:48:52 +03:00
Wojciech Mitros
1f9be235b8 mv: delete previously undetected ghost rows in PRUNE MATERIALIZED VIEW statement
The PRUNE MATERIALIZED VIEW statement is supposed to remove ghost rows from the
view. Ghost rows are rows in the view with no corresponding row in the base table.
Before this patch, only rows whose primary key columns of the base table had
different values than any of the base rows were treated as ghost rows by the PRUNE
statement. However, view rows which have a column in their primary key that's not
in the base primary can also be ghost rows if this column has a different value
than the base row with the same values of remaining primary key columns. That's
because these rows won't be deleted unless we change value of this column in the
base table to this specific value.
In this patch we add a check for this column in the PRUNE MATERIALIZED VIEW logic.
If this column isn't the same in the base table and the view, these rows are also
deleted.

Fixes https://github.com/scylladb/scylladb/issues/25655

Closes scylladb/scylladb#25720
2025-09-10 07:35:00 +02:00
Patryk Jędrzejczak
520cc0eeaa Merge 'test: fix race condition in test_long_join' from Emil Maskovsky
The test could trigger gossiper API calls before the API was properly registered, causing intermittent 404 errors. Previously the test waited for the "init - starting gossiper" log, but this appears before API registration completes.

Add explicit wait for gossiper API registration to ensure the endpoint is available before making requests, eliminating test flakiness.

Fixes: scylladb/scylladb#25582

No backport needed: Issue only observed in master so far.

Closes scylladb/scylladb#25583

* https://github.com/scylladb/scylladb:
  test: improve async execution in test_long_join
  test: fix race condition in test_long_join
2025-09-09 19:12:59 +02:00
Patryk Jędrzejczak
bb9fb7848a test: cluster: deflake consistency checks after decommission
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). Therefore, `check_token_ring_and_group0_consistency`
called just after decommission might fail when the decommissioned node
is still in group 0 (as a non-voter). We deflake all tests that call
`check_token_ring_and_group0_consistency` after decommission in this
commit.

Fixes #25809
2025-09-09 19:01:12 +02:00
Patryk Jędrzejczak
e41fc841cd test: cluster: util: handle group 0 changes after token ring changes in wait_for_token_ring_and_group0_consistency
In the Raft-based topology, a decommissioning node is removed from group
0 after the decommission request is considered finished (and the token
ring is updated). `wait_for_token_ring_and_group0_consistency` doesn't
handle such a case; it only handles cases where the token ring is
updated later. We fix this in this commit.

We rely on the new implementation of
`wait_for_token_ring_and_group0_consistency` in the following commit to
fix flakiness of some tests.

We also update the obsolete docstring in this commit.
2025-09-09 19:01:09 +02:00
Avi Kivity
5237a20993 Merge 'replica: Fix split compaction when tablet boundaries change' from Raphael Raph Carvalho
Consider the following:
1) balancer emits split decision
2) split compaction starts
3) split decision is revoked
4) emits merge decision
5) completes merge, before compaction in step 2 finishes

After last step, split compaction initiated in step 2 can fail because it works with the global tablet map, rather than the map when the compaction started. With the global state changing under its feet, on merge, the mutation splitting writer will think it's going backwards since sibling tablets are merged.

This problem was also seen when running load-and-stream, where split initiated by the sstable writer failed, split completed, and the unsplit sstable is left in the table dir, causing problems in the restart.

To fix this, let's make split compaction always work with the state when it started, not a global state.

Fixes #24153.

All 2025.* versions are vulnerable, so fix must be backported to them.

Closes scylladb/scylladb#25690

* github.com:scylladb/scylladb:
  replica: Fix split compaction when tablet boundaries change
  replica: Futurize split_compaction_options()
2025-09-09 17:05:32 +03:00
Dawid Mędrek
789a4a1ce7 test/perf: Adjust tablet_load_balancing.cc to RF-rack-validity
We modify the logic to make sure that all of the keyspaces that the test
creates are RF-rack-valid. For that, we distribute the nodes across two
DCs and as many racks as the provided replication factor.

That may have an effect on the load balancing logic, but since this is
a performance test and since tablet load balancing is still taking place,
it should be acceptable.

This commit also finishes work in adjusting perf tests to pass with
the `rf_rack_valid_keyspaces` configuration option enabled. The remaining
tests either don't attempt to create keyspaces or they already create
RF-rack-valid keyspaces.

We don't need to explicitly enable the configuration option. It's already
enabled by default by `cql_test_config`. The reason why we haven't run into
any issue because of that is that performance tests are not part of our CI.

Fixes scylladb/scylladb#25127

Closes scylladb/scylladb#25728
2025-09-09 12:46:46 +03:00
Botond Dénes
a89d0a747b Merge 'test.py: add different levels of verbosity for output' from Andrei Chekun
Add another level of verbosity: quiet.
Before this it was used as a default one, but it provides not enough information.
These changes should be coupled with pytest-sugar plugin to have an intended information for each level.
Invoke the pytest as a module, instead of a separate process, to get access to the terminal to be able to it interactively.

Framework change only, so backporting in to 2025.3

Fixes: #25403

Closes scylladb/scylladb#25698

* github.com:scylladb/scylladb:
  test.py: add additional level of verbosity for output
  test.py: start pytest as a module instead of subprocess
2025-09-09 11:49:51 +03:00
Asias He
cb7db47ae1 repair: Add incremental_mode option for tablet repair
This patch introduces a new `incremental_mode` parameter to the tablet
repair REST API, providing more fine-grained control over the
incremental repair process.

Previously, incremental repair was on and could not be turned off. This
change allows users to select from three distinct modes:

- `regular`: This is the default mode. It performs a standard
  incremental repair, processing only unrepaired sstables and skipping
  those that are already repaired. The repair state (`repaired_at`,
  `sstables_repaired_at`) is updated.

- `full`: This mode forces the repair to process all sstables, including
  those that have been previously repaired. This is useful when a full
  data validation is needed without disabling the incremental repair
  feature. The repair state is updated.

- `disabled`: This mode completely disables the incremental repair logic
  for the current repair operation. It behaves like a classic
  (pre-incremental) repair, and it does not update any incremental
  repair state (`repaired_at` in sstables or `sstables_repaired_at` in
  the system.tablets table).

The implementation includes:

- Adding the `incremental_mode` parameter to the
  `/storage_service/repair/tablet` API endpoint.
- Updating the internal repair logic to handle the different modes.
- Adding a new test case to verify the behavior of each mode.
- Updating the API documentation and developer documentation.

Fixes #25605

Closes scylladb/scylladb#25693
2025-09-09 06:50:21 +03:00
Avi Kivity
c4ed7dd814 Merge 'gossiper: fix issues in processing gossip status during the startup and when messages are delayed to avoid empty host ids' from Emil Maskovsky
Populate the local state during gossiper initialization in start_gossiping, preventing an empty state from being added to _endpoint_state_map and returned in get_endpoint_states responses, that was causing an 'empty host id issue' on the other nodes during nodes restart.

Check for a race condition in do_apply_state_locally In do_apply_state_locally, a race condition can occur if a task is suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map. When the task resumes, this can lead to inserting an entry with an empty host ID into the map, causing various errors, including a node crash.

This change adds a check after locking the map entry: if a gossip ACK update does not contain a host ID, we verify that an entry with that host ID still exists in the gossiper’s _endpoint_state_map.

Fixes https://github.com/scylladb/scylladb/issues/25831
Fixes https://github.com/scylladb/scylladb/issues/25803
Fixes https://github.com/scylladb/scylladb/issues/25702
Fixes https://github.com/scylladb/scylladb/issues/25621

Ref https://github.com/scylladb/scylla-enterprise/issues/5613

Backport: The issue affects all current releases(2025.x), therefore this PR needs to be backported to all 2025.1-2025.3.

Closes scylladb/scylladb#25849

* github.com:scylladb/scylladb:
  gossiper: fix empty initial local node state
  gossiper: add test for a race condition in start_gossiping
  gossiper: check for a race condition in `do_apply_state_locally`
  test/gossiper: add reproducible test for race condition during node decommission
2025-09-08 20:51:01 +03:00
Andrei Chekun
ea4cd431c9 test.py: add pytest-sugar plugin to the dependencies
This plugin allows having better terminal output with progress bar for
the tests.

Closes scylladb/scylladb#25845

[avi: regenerate frozen toolchain]

Closes scylladb/scylladb#25860
2025-09-08 20:50:02 +03:00
Radosław Cybulski
6d150e2d0c Fix oversized allocation in paxos under pressure
When cpu pressured, `_locks` structure in paxos might grow and cause
oversized allocations and performance drops. We reserve memory ahead of
time.

Fixes #25559

Closes scylladb/scylladb#25874
2025-09-08 20:49:00 +03:00
Yaron Kaikov
d57741edc2 build_docker.sh: enable debug symboles installation
Adding the latest scylla.repo location to our docker container, this
will allow installation scylla-debuginfo package in case it's needed

Fixes: https://github.com/scylladb/scylladb/issues/24271

Closes scylladb/scylladb#25646
2025-09-08 18:39:27 +03:00
Emil Maskovsky
f8c297ca27 test: improve async execution in test_long_join
Replace list comprehensions with asyncio.gather() to await the injection
API calls in fully concurrent manner.
2025-09-08 17:14:37 +02:00
Emil Maskovsky
a86bd06f08 test: fix race condition in test_long_join
The test could trigger gossiper API calls before the API was properly
registered, causing intermittent 404 errors. Previously the test waited
for the "init - starting gossiper" log, but this appears before API
registration completes.

Add explicit wait for gossiper API registration to ensure the endpoint
is available before making requests, eliminating test flakiness.

Fixes: scylladb/scylladb#25582
2025-09-08 17:14:37 +02:00
Pavel Emelyanov
34d1648d21 main: Properly handle zero allocation warning threshold
The --help text says about --large-memory-allocation-warning-threshold:

"Warn about memory allocations above this size; set to zero to disable."

That's half-true: setting the value to zero spams logs with warnings of
allocation of any size, as seastar treats zero threshold literaly.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#25850
2025-09-08 12:41:19 +02:00
Asias He
451e1ec659 streaming: Fix use after move in the tablet_stream_files_handler
The files object is moved before the log when stream finishes. We've
logged the files when the stream starts. Skip it in the end of
streaming.

Fixes #25830

Closes scylladb/scylladb#25835
2025-09-08 11:59:52 +02:00
Sergey Zolotukhin
b34d543f30 gossiper: fix empty initial local node state
This change removes the addition of an empty state to `_endpoint_state_map`.
Instead, a new state is created locally and then published via replicate,
avoiding the issue of an empty state existing in `_endpoint_state_map`
before the preemption point. Since this resolves the issue tested in
`test_gossiper_empty_self_id_on_shadow_round`, the `xfail` mark has been removed.

Fixes: scylladb/scylladb#25831
2025-09-08 11:38:31 +02:00
Sergey Zolotukhin
775642ea23 gossiper: add test for a race condition in start_gossiping
This change adds a test for a race condition in `start_gossiping` that
can lead to an empty self state sent in `gossip_get_endpoint_states_response`.

Test for scylladb/scylladb#25831
2025-09-08 11:38:30 +02:00
Sergey Zolotukhin
f08df7c9d7 gossiper: check for a race condition in do_apply_state_locally
In do_apply_state_locally, a race condition can occur if a task is
suspended at a preemption point while the node entry is not locked.
During this time, the host may be removed from _endpoint_state_map.
When the task resumes, this can lead to inserting an entry with an
empty host ID into the map, causing various errors, including a node
crash.

This change
1. adds a check after locking the map entry: if a gossip ACK update
   does not contain a host ID, we verify that an entry with that host ID
   still exists in the gossiper’s _endpoint_state_map.
2. Removes xfail from the test_gossiper_race test since the issue is now
   fixed.
3. Adds exception handling in `do_shadow_round` to skip responses from
   nodes that sent an empty host ID.

This re-applies the commit 13392a40d4 that
was reverted in 46aa59fe49, after fixing
the issues that caused the CI to fail.

Fixes: scylladb/scylladb#25702
Fixes: scylladb/scylladb#25621

Ref: scylladb/scylla-enterprise#5613
2025-09-08 11:38:30 +02:00
Emil Maskovsky
28e0f42a83 test/gossiper: add reproducible test for race condition during node decommission
This change introduces a targeted test that simulates the gossiper race
condition observed during node decommissioning. The test delays gossip
state application and host ID lookup to reliably reproduce the scenario
where `gossiper::get_host_id()` is called on a removed endpoint,
potentially triggering an abort in `apply_new_states`.

There is a specific error injection added to widen the race window, in
order to increase the likelihood of hitting the race condition. The
error injection is designed to delay the application of gossip state
updates, for the specific node that is being decommissioned. This should
then result in the server abort in the gossiper.

This re-applies the commit 5dac4b38fb that
was reverted in dc44fca67c, but modified
to relax the check from "on_internal_error" to a just warning log. The
more strict can be re-introduced later once we are sure that all
remaining problems are resolved and it will not break the CI.

Refs: scylladb/scylladb#25621
Fixes: scylladb/scylladb#25721
2025-09-08 11:38:30 +02:00