Commit Graph

10219 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
32bc7e3a1c Add creation_time field to task status and stats API
Co-authored-by: tgrabiec <283695+tgrabiec@users.noreply.github.com>
2025-12-08 13:26:09 +00:00
Andrei Chekun
5e83311305 test.py: switch to ThreadPoolExecutor
With python 3.14, the Process fails due to pickling issue with nodes objects.
This will eliminate this issue, so we can bump up the python version.

Closes scylladb/scylladb#27456
2025-12-07 17:37:25 +02:00
Avi Kivity
47efbdffbc Merge 'cache, mvcc: Preempt cache update when applying range tombstone from memtable' from Tomasz Grabiec
Range tombstones are represented as entry attributes, which applies to
the interval between entries. So if a range tombstone covers many
rows, to apply it we have to update all covered entries.  In some
workloads that could be many entries, even the whole cache.  Before
the patch, we did this update without preemption, which can cause
reactor stalls in such workloads.

This scenario is already covered by mvcc_tests,
e.g. test_apply_to_incomplete_respects_continuity. And I verified that
the new preemption point is hit in the test.

perf-row-cache-update results show no significant stalls anymore (max
2ms scheduling delay, instead of previous 1.5 s):

    Generated 1124195 rows
    Memtable fill took 4179.457520 [ms], {count: 8295, 99%: 0.654949 [ms], max: 32.817176 [ms]}
    Draining...
    took 0.000616 [ms]
    cache: 2506/2948 [MB], memtable: 781/1024 [MB], alloc/comp: 1051/662 [MB] (amp: 0.630)
    update: 2874.157471 [ms], preemption: {count: 26650, 99%: 1.131752 [ms], max: 2.068762 [ms]}, cache: 3027/3973 [MB], alloc/comp: 3951/2424 [MB] (amp: 0.614), pr/me/dr 1124195/0/0

Fixes #23479
Fixes #2578

Closes scylladb/scylladb#27469

* github.com:scylladb/scylladb:
  cache, mvcc: Preempt cache update when applying range tombstone from memtable
  partition_snapshot_row_cursor: Clarify non-obvious semantic difference of range_tombstone()
  perf-row-cache-update: Add scenario with large tombstone covering many rows
2025-12-07 11:54:15 +02:00
Avi Kivity
d811eeb4ca Merge 'Make direct failure detector verb handler more efficient' from Gleb Natapov
We saw that in large clusters direct failure detector may cause large task queues to be accumulated. The series address this issue and also moves the code into the correct scheduling group.

Fixes https://github.com/scylladb/scylladb/issues/27142

Backport to all version where 60f1053087 was backported to since it should improve performance in large clusters.

Closes scylladb/scylladb#27387

* github.com:scylladb/scylladb:
  direct_failure_detector: run direct failure detector in the gossiper scheduling group
  raft: drop invoke_on from the pinger verb handler
  direct_failure_detector: pass timeout to direct_fd_ping verb
2025-12-07 11:40:26 +02:00
Tomasz Grabiec
721434054b perf-row-cache-update: Add scenario with large tombstone covering many rows
Fills memtable with rows and a tombstone which deletes all rows which
are already in cache.

Similar to raft log workload, but more extreme.

With -c1 -m4G, observed really bad performance:

update: 1711.976196 [ms], preemption: {count: 22603, 99%: 0.943127 [ms], max: 1494.571776 [ms]}, cache: 2148/2906 [MB], alloc/comp: 1334/869 [MB] (amp: 0.651), pr/me/dr 1062186/0/1062187
cache: 2148/2906 [MB], memtable: 738/1024 [MB], alloc/comp: 993/0 [MB] (amp: 0.000)

Which means that max reactor stall during cache update was 1.5 [s]
0.7 GB memtables. 2.1 GB in cache.
2025-12-06 01:03:09 +01:00
Nadav Har'El
350cbd1d66 alternator: fix typo of BatchWriteItem in comments
The DynamoDB API's "BatchWriteItem" operation is spelled like this, in
singular. Some comments incorrectly referred to as BatchWriteItems - in
plural. This patch fixes those mistakes.

There are no functional changes here or changes to user-facing documents -
these mistakes were only in code comments.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27446
2025-12-05 15:08:58 +02:00
Botond Dénes
866c96f536 Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.

Backport is not required, it is a new feature

Fixes #20100

Closes scylladb/scylladb#27287

* github.com:scylladb/scylladb:
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: Add TemporaryScylla metadata component type
  sstables: Extract file writer closing logic into separate methods
  sstables: Add components_digests to scylla metadata components
  sstables: Implement CRC32 digest-only writer
2025-12-05 11:36:50 +02:00
Botond Dénes
367633270a Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund
Fixes #27367
Fixes #27362
Fixes #27366

Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.

Closes scylladb/scylladb#27368

* github.com:scylladb/scylladb:
  ear::kms/ear::azure: Use utils::http URL parsing
  ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
  utils::http: Handle ipv6 numeric host part in URL:s
2025-12-05 10:43:07 +02:00
Asias He
e97a504775 repair: Allow min max range to be updated for repair history
It is observed that:

repair - repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: Failed to update
system.repair_history table of node d27de212-6f32-4649ad76-a9ef1165fdcb:
seastar::rpc::remote_verb_error (repair[667d4a59-63fb-4ca6-8feb-98da49946d8b]: range (minimum
token,maximum token) is not in the format of (start, end])

This is because repair checks the end of the range to be repaired needs
to be inclusive. When small_table_optimization is enabled for regular
repair, a (minimum token,maximum token) will be used.

To fix, we can relax the check of (start, end] for the min max range.

Fixes #27220

Closes scylladb/scylladb#27357
2025-12-05 10:41:25 +02:00
Piotr Dulikowski
bb6e41f97a index: allow vector indexes without rf_rack_valid_keyspces
The rf_rack_valid_keyspaces option needs to be turned on in order to
allow creating materialized views in tablet keyspaces with numeric RF
per DC. This is also necessary for secondary indexes because they use
materialized views underneath. However, this option is _not_ necessary
for vector store indexes because those use the external vector store
service for querying the list of keys to fetch from the main table, they
do not create a materialized view. The rf_rack_valid_keyspaces was, by
accident, required for vector indexes, too.

Remove the restriction for vector store indexes as it is completely
unnecessary.

Fixes: SCYLLADB-81

Closes scylladb/scylladb#27447
2025-12-05 09:26:26 +02:00
Taras Veretilnyk
0c8730ba05 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2025-12-04 21:09:01 +01:00
Patryk Jędrzejczak
f4c3d5c1b7 Merge 'fix test_coordinator_queue_management flakiness' from Gleb Natapov
After 39cec4ae45 node join may fail with either "request canceled" notification or (very rarely) because it was banned. Depend on timing. The series fixes the test to check for both possibilities.

Fixes #27320

No need to backport since the flakiness is in the mater only.

Closes scylladb/scylladb#27408

* https://github.com/scylladb/scylladb:
  test: fix test_coordinator_queue_management flakiness
  test/pylib: allow expected_error in server_start to contain regular expression
2025-12-04 16:08:02 +01:00
Tomasz Grabiec
e54abde3e8 Merge 'main: delay setup of storage_service REST API' from Andrzej Jackowski
The storage_service REST API uses `group0` internally. Before this
patch, it was possible to send an HTTP request before `group0` was
initialized, which resulted in a segmentation fault. Therefore,
this patch delays the setup of the storage_service REST API.

Additionally, `test_rest_api_on_startup` is added to reproduce the problem.

Fixes: https://github.com/scylladb/scylladb/issues/27130

No backport. It's a crash fix but possible only if a request is sent in a very specific phase of a node start.

Closes scylladb/scylladb#27410

* github.com:scylladb/scylladb:
  test: add test_rest_api_on_startup
  main: delay setup of storage_service REST API
2025-12-04 14:56:49 +01:00
Calle Wilund
4e289e8e6a utils::http: Handle ipv6 numeric host part in URL:s
Fixes #27366

A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.

I.e.
http://[2001:db8:4006:812::200e]:8080

v2:
* Include scheme agnostic parse + case insensitive scheme matching
2025-12-04 11:38:41 +00:00
Botond Dénes
9d2f7c3f52 Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE

Fixes https://github.com/scylladb/scylladb/issues/27070

Closes scylladb/scylladb#27097

* github.com:scylladb/scylladb:
  mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
  cql: add CONCURRENCY to the USING clause
2025-12-04 11:47:41 +02:00
Gleb Natapov
86dde50c0d direct_failure_detector: run direct failure detector in the gossiper scheduling group
When direct failure detector was introduces the idea was that it will
run on the same connection raft group0 verbs are running, but in
60f1053087 raft verbs were moved to run on the gossiper connection
while DIRECT_FD_PING was left where it was. This patch move it to
gossiper connection as well and fix the pinger code to run in gossiper
scheduling group.
2025-12-04 11:35:43 +02:00
Gleb Natapov
f00e00fde0 test: fix test_coordinator_queue_management flakiness
After 39cec4ae45 node join may fail with either "request canceled"
notification or (very rarely) because it was banned. Depend on timing.
The patch fixes the test to check for both possibilities.
2025-12-04 11:06:20 +02:00
Gleb Natapov
b0727d3f2a test/pylib: allow expected_error in server_start to contain regular expression
Currently expected_error parameter to server_start can only work with
exact matches. Change it to support regular expressions.
2025-12-04 11:06:20 +02:00
Tomasz Grabiec
1d42770936 Merge 'topology_coordinator: Add barrier to cleanup_target' from Łukasz Paszkowski
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512

It's a pre existing issue. Backport is required to all recent 2025.x versions.

Closes scylladb/scylladb#27413

* github.com:scylladb/scylladb:
  topology_coordinator: Fix the indentation for the cleanup_target case
  topology_coordinator: Add barrier to cleanup_target
  test_node_failure_during_tablet_migration: Increase RF from 2 to 3
2025-12-03 23:57:45 +01:00
Łukasz Paszkowski
67f1c6d36c topology_coordinator: Add barrier to cleanup_target
Consider the following scenario:
1. A table has RF=3 and writes use CL=QUORUM
2. One node is down
3. There is a pending tablet migration from the unavailable node
   that is reverted

During the revert, there can be a time window where the pending replica
being cleaned up still accepts writes. This leads to write failures,
as only two nodes (out of four) are able to acknowledge writes.

This patch fixes the issue by adding a barrier to the cleanup_target
tablet transition state, ensuring that the coordinator switches back to
the previous replica set before cleanup is triggered.

Fixes https://github.com/scylladb/scylladb/issues/26512
2025-12-03 16:19:17 +01:00
Łukasz Paszkowski
669286b1d6 test_node_failure_during_tablet_migration: Increase RF from 2 to 3
The patch prepares the test for additional write workload to be
executed in parallel with node failures. With the original RF=2,
QUORUM is also 2, which causes writes to fail during node outage.

To address it, the third rack with a single node is added and the
replication factor is increased to 3.
2025-12-03 16:00:19 +01:00
Botond Dénes
b9199e8b24 Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz
Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods.

To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path.

Dependency diagram:
```
|Authentication Layer|
            |
            v
+--------------------------------+
|          Auth Cache            |
+--------------------------------+
        ^                      |
        |                      |
        |                      v
|Raft Write Logic | | CQL Read Layer|
```

Cache invalidation is based on raft and the cache contains full content of related tables.

Ldap role manager may benefit partially as can_logic function is common  and will be cached,
but it still needs to query roles from external source.

Performance results:

For single shard connection/disconnection scenario insns/conn decreased by *5%*,
allocs/conn decreased by *23%*, tasks/conn decreased by *20%*. Results for 20 shards are very similar.

Raw data before:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1128.55 tps (599.2 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op,        0 errors)
1157.41 tps (601.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op,        0 errors)
1167.42 tps (603.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op,        0 errors)
1159.63 tps (605.9 allocs/op,   0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op,        0 errors)
1165.12 tps (608.8 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op,        0 errors)
throughput:
	mean=   1155.63 standard-deviation=15.66
	median= 1159.63 median-absolute-deviation=9.49
	maximum=1167.42 minimum=1128.55
instructions_per_op:
	mean=   2602934.31 standard-deviation=16063.01
	median= 2603234.19 median-absolute-deviation=13887.96
	maximum=2625804.05 minimum=2586609.82
cpu_cycles_per_op:
	mean=   1359576.30 standard-deviation=5945.69
	median= 1360607.05 median-absolute-deviation=4358.94
	maximum=1365736.42 minimum=1350912.10
```

Raw data after:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1132.09 tps (457.5 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op,        0 errors)
1157.70 tps (458.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op,        0 errors)
1162.86 tps (459.0 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op,        0 errors)
1153.15 tps (460.2 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op,        0 errors)
1142.09 tps (460.6 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op,        0 errors)
1124.89 tps (462.5 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op,        0 errors)
1156.75 tps (464.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op,        0 errors)
1152.16 tps (466.3 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op,        0 errors)
1154.77 tps (469.8 allocs/op,   0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op,        0 errors)
1152.22 tps (472.4 allocs/op,   0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op,        0 errors)
throughput:
	mean=   1148.87 standard-deviation=12.08
	median= 1153.15 median-absolute-deviation=7.88
	maximum=1162.86 minimum=1124.89
instructions_per_op:
	mean=   2487755.88 standard-deviation=43838.23
	median= 2478900.02 median-absolute-deviation=24531.06
	maximum=2571954.26 minimum=2432485.38
cpu_cycles_per_op:
	mean=   1304144.76 standard-deviation=22129.55
	median= 1305025.71 median-absolute-deviation=12363.25
	maximum=1345341.16 minimum=1270655.17
```

Fixes https://github.com/scylladb/scylladb/issues/18891
Backport: no, it's a new feature

Closes scylladb/scylladb#26841

* github.com:scylladb/scylladb:
  auth: use auth cache on login path
  auth: corutinize standard_role_manager::can_login
  main: auth: add auth cache dependency to auth service
  raft: update auth cache when data changes
  auth: storage_service: reload auth cache on v1 to v2 auth migration
  raft: reload auth cache on snapshot application
  service: add auth cache getter to storage service
  main: start auth cache service
  auth: add unified cache implementation
  auth: move table names to common.hh
2025-12-03 16:45:01 +02:00
Andrzej Jackowski
1ff7f5941b test: add test_rest_api_on_startup
This test verifies that REST API requests are handled properly
when a server is started or restarted. It is used to verify
the fix for scylladb/scylladb#27130, where a server failed with a
segmentation fault when `storage_service/raft_topology/reload` was
called too early.

Refs: scylladb/scylladb#27130
2025-12-03 15:35:59 +01:00
Pavel Emelyanov
6ae72ed134 test: Reuse S3 fixtures facilities in cqlpy/test_tools.py
Creating endpoint conf can be made with the s3_server method
Getting boto3 resource from s3_server itself is also possible

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27380
2025-12-03 16:32:54 +02:00
Michael Litvak
9213a163cb test: fix test flakiness in test_colocated_tables_gc_mode
The test executes a LWT query in order to create a paxos state table and
verify the table properties. However, after executing the LWT query, the
table may not exist on all nodes but only on a quorum of nodes, thus
checking the properties of the table may fail if the table doesn't exist
on the queried node.

To fix that, execute a group0 read barrier to ensure the table is
created on all nodes.

Fixes scylladb/scylladb#27398

Closes scylladb/scylladb#27401
2025-12-03 12:12:24 +01:00
Nadav Har'El
7dc04b033c test/cluster: fix missing racks in xfailing Alternator test
Since Alternator is now using tablets by default, it's no longer possible
to create an Alternator table on a 3-node cluster with a single rack -
you need to have 3 racks to support RF=3.

Most of the multi-node Alternator tests in test/cluster/test_alternator.py
were already fixed to use a 3-rack cluster, but one test was missed
because it was marked "xfail" so its new failure to create the table was
missed. This patch adds the missing 3-rack setup, so the xfailing test
returns to failing on the real bug - not on the table creation.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#27382
2025-12-03 10:54:11 +03:00
Piotr Dulikowski
654ac9099b db/view/view_building_coordinator: skip work if no view is built
Even though that `view_building_coordinator::work_on_view_building` has
an `if` at the very beginning which checks whether the currently
processed base table is set, it only prints a message and continues
executing the rest of the function regardless of the result of the
check. However, some of the logic in the function assumes that the
currently processed base table field is set and tries to access the
value of the field. This can lead to the view building coordinator
accessing a disengaged optional, which is undefined behavior.

Fix the function by adding the clearly missing `co_await` to the check.
A regression test is added which checks that the view building state
observer - a different fiber which used to print a weird message due to
erroneus view building coordinator behavior - does not print a warning.

Fixes: scylladb/scylladb#27363

Closes scylladb/scylladb#27373
2025-12-03 09:44:28 +02:00
Gleb Natapov
82f80478b8 direct_failure_detector: pass timeout to direct_fd_ping verb
Currently direct_fd_ping runs without timeout, but the verb is not
waited forever, the wait is canceled after a timeout, this timeout
simply is not passed to the rpc. It may create a situation where the
rpc callback can runs on a destination but it is no longer waited on.
Change the code to pass timeout to rpc as well and return earlier from
the rpc handler if the timeout is reached by the time the callback is
called. This is backwards compatible since timeout is passed as
optional.
2025-12-02 14:55:20 +02:00
Botond Dénes
357f91de52 Revert "Merge 'db/config: enable ms sstable format by default' from Michał Chojnowski"
This reverts commit b0643f8959, reversing
changes made to e8b0f8faa9.

The change forgot to update
sstables_manager::get_highest_supported_format(), which results in
/system/highest_supported_sstable_version still returning me, confusing
and breaking tests.

Fixes: scylladb/scylla-dtest#6435

Closes scylladb/scylladb#27379
2025-12-02 14:38:56 +02:00
Pawel Pery
b5c85d08bb unittest: fix vector_store_client_test_dns_refresh_aborted hangs
The root cause for the hanging test is a concurrency deadlock.
`vector_store_client` runs dns refresh time and it is waiting for the condition
variable.After aborting dns request the test signals the condition variable.
Stopping the vector_store_client takes time enough to trigger the next dns
refresh - and this time the condition variable won't be signalled - so
vector_store_client will wait forever for finish dns refresh fiber.

The commit fixes the problem by waiting for the condition variable only once.

Fixes: #27237
Fixes: VECTOR-370

Closes scylladb/scylladb#27239
2025-12-02 12:22:44 +01:00
Piotr Dulikowski
3aaab5d5a3 Merge 'vector_search: Fix high availability during timeouts' from Karol Nowacki
This PR introduces two key improvements to the robustness and resource management of vector search:

Proper Abort on CQL Timeout: Previously, when a CQL query involving a vector search timed out
, the underlying ANN query to the vector store was not aborted and would continue to run. This has been fixed by ensuring the abort source is correctly signaled, terminating the ANN request when its parent CQL query expires and preventing unnecessary resource consumption.

Faster Failure Detection: The connection and keep-alive timeouts for vector store nodes were excessively long (2 and 11 minutes, respectively), causing significant delays in detecting and recovering from unreachable nodes. These timeouts are now aligned with the request_timeout_in_ms setting, allowing for much faster failure detection and improving high availability by failing over from unresponsive nodes more quickly.

Fixes: SCYLLADB-76

This issue affects the 2025.4 branch, where similar HA recovery delays have been observed.

Closes scylladb/scylladb#27377

* github.com:scylladb/scylladb:
  vector_search: Fix ANN query abort on CQL timeout
  vector_search: Reduce connection and keep-alive timeouts
2025-12-02 11:14:48 +01:00
Karol Nowacki
086c6992f5 vector_search: Fix ANN query abort on CQL timeout
When a CQL vector search request timed out, the underlying ANN query was
not aborted and continued to run. This happened because the abort source
was not being signaled upon request expiration.
This commit ensures the ANN query is aborted when the CQL request times out
preventing unnecessary resource consumption.
2025-12-02 01:17:01 +01:00
Karol Nowacki
b6afacfc1e vector_search: Reduce connection and keep-alive timeouts
The connection timeout was 2 minutes and the keep-alive
timeout was 11 minutes. If a vector store node became unreachable, these
long timeouts caused significant delays before the system could recover,
negatively impacting high availability.

This change aligns both timeouts with the `request_timeout`
configuration, which defaults to 10 seconds. This allows for much
faster failure detection and recovery, ensuring that unresponsive nodes
are failed over from more quickly.
2025-12-02 01:17:01 +01:00
Avi Kivity
ce2a403f18 Merge 'alternator: implement gzip-compressed requests' from Nadav Har'El
In this series we implement Alternator's support for gzip-compressed
requests, i.e., requests with the "Content-Encoding: gzip" header,
other uncompressed header, and a gzip-compressed body.

The server needs to verify the signature of the *compressed* content,
and then uncompress the body before running the request.

We only support gzip compression because this is what DynamoDB supports.
But in the future we can easily add support for other compression
algorithms like lz4 or zstd.

This series Refs #5041 but doesn't "Fixes" it because it only implements
compressed requests (Content-Encoding), *not* compressed responses
(Accept-Encoding).

In addition to the code changes, the series also contains tests for this
feature that make sure it behaves like DynamoDB.

Note that while we will have now support in our server for compressed
requests, just like DynamoDB does, the clients (AWS SDKs) will probably
NOT make use of it because they do not enable request compression by
default. For example, see the tests for some hoops one needs to jump
through in boto3 (the Python SDK) to send compressed requests. However,
we are hoping that in the future Alternator's modified clients will
use compressed requests and enjoy this feature.

Closes scylladb/scylladb#27080

* github.com:scylladb/scylladb:
  test/alternator: enable, and add, tests for gzip'ed requests
  alternator: implement gzip-compressed requests
2025-11-30 13:27:46 +02:00
Avi Kivity
d4be9a058c Update seastar submodule
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.

* seastar 8c3fba7a...b5c76d6b (3):
  > testing: There can be only one memory_data_sink
  > util: Use std::source_location directly
  > Merge 'net: support proxy protocol v2' from Avi Kivity
    apps: httpd: add --load-balancing-algorithm
    apps: httpd: add /shard endpoint
    test: socket_test: add proxy protocol v2 test suite
    test: socket_test: test load balancer with proxy protocol
    net: posix_connected_socket: specialize for proxied connections
    net: posix_server_socket_impl: implement proxy protocol in server sockets
    net: posix_server_socket_impl: adjust indentation
    net: posix_server_socket_impl: avoid immediately-invoked lambda
    net: conntrack: complete handle nested class special member functions
    net: posix_server_socket_impl: coroutinize accept()

Closes scylladb/scylladb#27316
2025-11-30 12:38:47 +02:00
Piotr Dulikowski
44c605e59c Merge 'Fix the types of change events in Alternator Streams' from Piotr Wieczorek
This patch increases the compatibility with DynamoDB Streams by integrating the DynamoDB's event type rules (described in https://github.com/scylladb/scylladb/issues/6918) into Alternator. The main changes are:
- introduce a new flag `alternator_streams_strict_compatibility`, meant as a guard of performance-intensive operations that increase the compatibility with DynamoDB Streams. If enabled, Alternator always performs a RBW before a data-modifying operation, and propagates its result to CDC. Then, the old item is compared to the new one, to determine the mutation type (INSERT vs MODIFY). This option is a no-op for tables with disabled Alternator Streams,
- reduce splitting of simple Alternator mutations,
- correctly distinguish event types described in #6918, except for item deletes. Deleting a missing item with DeleteItem, BatchWriteItem, or a missing field with UpdateItem still emit REMOVEs.

To summarize, the emitted events of the data manipulation operations should be as follows:
- DeleteItem/BatchWriteItem.DeleteItem of existing item: REMOVE (OK)
- DeleteItem of nonexistent item: nothing (OK)
- BatchWriteItem.DeleteItem of nonexistent item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and not equal item: MODIFY (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of existing and equal item: nothing (OK)
- PutItem/UpdateItem/BatchWriteItem.PutItem of nonexistent item: INSERT (OK)

No backport is necessary.

Refs https://github.com/scylladb/scylladb/pull/26149
Refs https://github.com/scylladb/scylladb/pull/26396
Refs https://github.com/scylladb/scylladb/issues/26382
Fixes https://github.com/scylladb/scylladb/issues/6918

Closes scylladb/scylladb#26121

* github.com:scylladb/scylladb:
  test/alternator: Enable the tests failing because of #6918
  alternator, cdc: Don't emit events for no-op removes
  alternator, cdc: Don't emit an event for equal items
  alternator/streams, cdc: Differentiate item replace and item update in CDC
  alternator: Change the return type of rmw_operation_return
  config: Add alternator_streams_strict_compatibility flag
  cdc: Don't split a row marker away from row cells
2025-11-30 07:20:22 +01:00
Asias He
da5cc13e97 repair: Fix deadlock when topology coordinator steps down in the middle
Consider this:

1) n1 is the topology coordinator
2) n1 schedules and executes a tablet repair with session id s1 for a
tablet on n3 an n4.
3) n3 and n4 take and store the in _rs._repair_compaction_locks[s1]
4) n1 steps down before it executes
locator::tablet_transition_stage::end_repair
5) n2 becomes the new topology coordinator
6) n2 runs locator::tablet_transition_stage::repair again
7) n3 and n4 try to take the lock again and hangs since the lock is
already taken.

To avoid the deadlock, we can throw in step 7 so that n2 will
proceed to end_repair stage and release the lock. After that, the
scheduler could schedule the tablet repair request again.

Fixes #26346

Closes scylladb/scylladb#27163
2025-11-28 15:14:39 +01:00
Radosław Cybulski
b54a9f4613 Fix use-after-free in encode_paging_state in Alternator
Fix unlikely use-after-free in `encode_paging_state`. The function
incorrectly assumes that current position to encode will always have
data for all clustering columns the schema defines. It's possible to
encounter current position having less than all columns specified, for
eample in case of range tombstone. Those don't happen in Alternator
tables as DynamoDB doesn't allow range deletions and clustering key
might be of size at most 1. Alternator api can be used to read
scylla system tables and those do have range tombstones with more
than single clustering column.

The fix is to stop trying to encode columns, that don't have the value -
they are not needed anyway, as there's no possible position with those
values (range tombstone made sure of that).

Fixes #27001
Fixes #27125

Closes scylladb/scylladb#26960
2025-11-28 16:51:15 +03:00
Pavel Emelyanov
d35ce81ff1 Merge 'test: wait for read_barrier in wait_until_driver_service_level_created' from Andrzej Jackowski
Previously, `wait_until_driver_service_level_created` only waited for
the `driver` service level to appear in the output of
`LIST ALL SERVICE_LEVELS`. However, the fact that one node lists
`sl:driver` does not necessarily mean that all other nodes can see
it yet. This caused sporadic test failures, especially in DEBUG builds.

To prevent these failures, this change adds an extra wait for
a `raft/read_barrier` after the `driver` service level first appears.
This ensures the service level is globally visible across the cluster.

Fixes: https://github.com/scylladb/scylladb/issues/27019

Na backport - test fix for `sl:driver` tests, and this that is only available on `master`

Closes scylladb/scylladb#27076

* github.com:scylladb/scylladb:
  test: wait for read_barrier in wait_until_driver_service_level_created
  test: use ManagerClient in wait_until_driver_service_level_created
2025-11-28 16:47:29 +03:00
Dawid Mędrek
b76af2d07f cql3: Improve errors when manipulating default service level
Before this commit, any attempt to create, alter, attach, or drop
the default service level would result in a syntax error whose
error message was unclear:

```
cqlsh> attach service level default to cassandra;
SyntaxException: line 1:21 no viable alternative at input 'default'
```

The error stems from the grammar not being able to parse `default`
as a correct service level name. To fix that, we cover it manually.
This way, the grammar accepts it and we can process it in Scylla.

The reason why we'd like to cover the default service level is that
it's an actual service level that the user should reference. Getting
a syntax error is not what should happen. Hence this fix.

We validate the input and if the given role is really the default
service level, we reject the query and provide an informative error
message.

Two validation tests are provided.

Fixes scylladb/scylladb#26699

Closes scylladb/scylladb#27162
2025-11-28 15:32:37 +03:00
Calle Wilund
59c87025d1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236
2025-11-28 15:26:46 +03:00
Michael Litvak
97b7c03709 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312
2025-11-28 11:17:12 +01:00
Pavel Emelyanov
54edb44b20 code: Stop using seastar::compat::source_location
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.

The patch is

  for f in $(git grep -l 'seastar::compat::source_location'); do
    sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
  done

and removal of few header includes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27309
2025-11-27 19:10:11 +02:00
Andrzej Jackowski
e366030a92 treewide: seastar module update
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.

The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
 - Remove `seastar/util/modules.hh` includes as the file was removed
   from seastar
 - Modified `metrics::impl::labels_type` construction in
   `test/boost/group0_test.cc` because now it requires `escaped_string`

* seastar 340e14a7...8c3fba7a (32):
  > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
    dns: Optimize packet sending for newer c-ares versions
    dns: Replace net::packet with vector<temporary_buffer>
    dns: Remove unused local variable
    dns: Remove pointless for () loop wrapping
    dns: Introduce do_sendv_tcp() method
    dns: Introduce do_send_udp() method
  > test: Add http rules test of matching order
  > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
    memcached: Patch test to use memory_data_source
    memcached: Use memory_data_source in server
    rpc: Use memory_data_sink without constructing net::packet
    util: Generalize packet_data_source into memory_data_source
  > tests: coroutines: restore "explicit this" tests
  > reactor: remove blocking of SIGILL
  > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
    github: Use gcc-14
    github: Use clang-20
  > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
    test: Make test_resolve() try several addresses
    test: Coroutinize test_resolve() helper
  > modules: make module support standards-compliant
  > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
    dns: Squash two if blocks together
    dns: Do not check tcp entry for udp type
  > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
  > promise: Document that promise is resolved at most once
  > coroutine: exception: workaround broken destroy coroutine handle in await_suspend
  > socket: Return unspecified socket_address for unconnected socket
  > smp: Fix exception safety of invoke_on_... internal copying
  > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
    reactor: Keep loads timer on reactor
    reactor: Update loads evaluation loop
  > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
    test: extend tests/unit/api.json to use 'integer' type
    scripts: add 'integer' type to seastar-json2code
  > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
    tls: Rename do_put_one(temporary_buffer) into do_put()
    tls: Fix indentation after previous patch
    tls: Move semaphore grab into iterating do_put()
  > net: tcp: change unsent queue from packets to temporary_buffer:s
  > timer: Enable highres timer based on next timeout value
  > rpc: Add a new constructor in closed_error to accept string argument
  > memcache: Implement own data sink for responses
  > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
    file: do_recursive_remove_directory(): move object when popping from queue
    file: do_recursive_remove_directory(): adjust indentation
    file: do_recursive_remove_directory(): coroutinize
    file: do_recursive_remove_directory(): simplify conditional
    file: do_recursive_remove_directory(): remove wrong const
    file: do_recursive_remove_directory(): clean up work_entry
  > tests: Move thread_context_switch_test into perf/
  > test: Add unit test for append_challenged_posix_file
  > Merge 'Prometheus metrics handler optimization' from Travis Downs
    prometheus: optimize metrics aggregation
    prometheus: move and test aggregate_by helper
    prometheus: various optimizations
    metrics: introduce escaped_string for label values
    metric:value: implement + in terms of +=
    tests: add prometheus text format acceptance tests
    extract memory_data_sink.hh
    metrics_perf: enhance metrics bench
  > demos: Simplify udp_zero_copy_demo's way of preparing the packet
  > metrics: Remove deprecated make_...-ers
  > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
    test: Use BOOST_REQUIRE checkers
    test: Replace some SEASTAR_ASSERT-s with static_assert-s
    test: Convert slab test into boost kind
  > Merge 'Coroutinize lister_test' from Pavel Emelyanov
    test: Fix indentation after previuous patch
    test: Coroutinize lister_test lister::report() method
    test: Coroutinize lister_test main code
  > file: recursive_remove_directory(): use a list instead of a deque
  > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
    tls: Stop using net::packet in session::put()
    tls: Fix indentation after previous patch
    tls: Split session::do_put()
    tls: Mark some session methods private

Closes scylladb/scylladb#27240
2025-11-27 12:34:22 +02:00
Nadav Har'El
32afcdbaf0 test/alternator: enable, and add, tests for gzip'ed requests
After in the previous patch we implemented support in Alternator for
gzip-compressed requests ("Content-Encoding: gzip"), here we enable
an existing xfail-ing test for this feature, and also add more tests
for more cases:

  * A test for longer compressed requests, or a short compressed
    request which expands to a longer request. Since the decompression
    uses small buffers, this test reaches additional code paths.

  * Check for various cases of a malformed gzip'ed request, and also
    an attempt to use an unsupported Content-Encoding. DynamoDB
    returns error 500 for both cases, so we want to test that we
    do to - and not silently ignore such errors.

  * Check that two concatenated gzip'ed streams is a valid request,
    and check that garbage at the end of the gzip - or a missing
    character at the end of the gzip - is recognized as an error.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-27 09:42:47 +02:00
Wojciech Mitros
323e5cd171 mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Fixes https://github.com/scylladb/scylladb/issues/27070
2025-11-27 00:02:28 +01:00
Patryk Jędrzejczak
cc273e867d Merge 'fix notification about expiring erm held for to long' from Gleb Natapov
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

Closes scylladb/scylladb#27140

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-26 12:59:00 +01:00
Marcin Maliszkiewicz
b29c42adce main: auth: add auth cache dependency to auth service
In the following commit we'll switch some authorizer
and role manager code to use the cache so we're preparing
the dependency.
2025-11-26 12:01:31 +01:00
Marcin Maliszkiewicz
2cf1ca43b5 service: add auth cache getter to storage service
Prepare for use in a subsequent commit in group0_state_machine,
where the auth cache will be integrated. This follows the same
pattern as updates to the service-level cache, view-building
state, and CDC streams.
2025-11-26 12:00:50 +01:00
Marcin Maliszkiewicz
642f468c59 main: start auth cache service
The service is not yet used anywhere,
we first build scaffolding.
2025-11-26 12:00:50 +01:00