Commit Graph

4378 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Nadav Har'El
95e303faf3 Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.

Fixes https://github.com/scylladb/scylladb/issues/26313

Closes scylladb/scylladb#27383

* github.com:scylladb/scylladb:
  mv: replace the simple/complex rack-aware pairing with exact rack matching
  mv: split out vnode pairing code from get_view_natural_endpoint
  mv: unify self-pairing and rack-aware pairing into one bool
  mv: remove the workaround for left nodes when sending view updates
2025-12-09 13:19:13 +02:00
Pavel Emelyanov
fb32e1c7ee Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes

The change is purely aesthetic, no need to backport

Closes scylladb/scylladb#27101

* github.com:scylladb/scylladb:
  streaming: remove unnecessary lambda creating sstable token range
  streaming: simplify get_sstables_for_tablets logic
  streaming: switch to range-based for loop
  streaming: drop sstable skip microoptimization in tablet loop
  streaming: replace reverse iterators with reverse view in sstables scan
  streaming: return from get_sstables_for_tablets earlier
  streaming: add get_sstables_by_tablet_range tests
  test,sstables: add helper to set sstable first and last keys
  streaming: refactor get_sstables_for_tablets to make it accessible
  streaming: refactor get_sstables_for_tablets to make it testable
  streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
2025-12-09 10:53:57 +03:00
Michał Jadwiszczak
51843195f7 test/boost/view_build_test: increase number of retires
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.

This patch increases the number of retries to 30.

Fixes scylladb/scylladb#27370

Closes scylladb/scylladb#27493
2025-12-08 23:14:01 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Ernest Zaslavsky
6fd5160947 streaming: add get_sstables_by_tablet_range tests
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
2025-12-08 12:30:23 +02:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Botond Dénes
866c96f536 Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.

Backport is not required, it is a new feature

Fixes #20100

Closes scylladb/scylladb#27287

* github.com:scylladb/scylladb:
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: Add TemporaryScylla metadata component type
  sstables: Extract file writer closing logic into separate methods
  sstables: Add components_digests to scylla metadata components
  sstables: Implement CRC32 digest-only writer
2025-12-05 11:36:50 +02:00
Botond Dénes
367633270a Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund
Fixes #27367
Fixes #27362
Fixes #27366

Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.

Closes scylladb/scylladb#27368

* github.com:scylladb/scylladb:
  ear::kms/ear::azure: Use utils::http URL parsing
  ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
  utils::http: Handle ipv6 numeric host part in URL:s
2025-12-05 10:43:07 +02:00
Taras Veretilnyk
0c8730ba05 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2025-12-04 21:09:01 +01:00
Calle Wilund
4e289e8e6a utils::http: Handle ipv6 numeric host part in URL:s
Fixes #27366

A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.

I.e.
http://[2001:db8:4006:812::200e]:8080

v2:
* Include scheme agnostic parse + case insensitive scheme matching
2025-12-04 11:38:41 +00:00
Benny Halevy
07b92a1ee8 test: database_test: add snapshot_use_sstable_identifier_works
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:38 +02:00
Benny Halevy
7504d10d9e test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Botond Dénes
9d2f7c3f52 Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE

Fixes https://github.com/scylladb/scylladb/issues/27070

Closes scylladb/scylladb#27097

* github.com:scylladb/scylladb:
  mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
  cql: add CONCURRENCY to the USING clause
2025-12-04 11:47:41 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Botond Dénes
b9199e8b24 Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz
Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods.

To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path.

Dependency diagram:
```
|Authentication Layer|
            |
            v
+--------------------------------+
|          Auth Cache            |
+--------------------------------+
        ^                      |
        |                      |
        |                      v
|Raft Write Logic | | CQL Read Layer|
```

Cache invalidation is based on raft and the cache contains full content of related tables.

Ldap role manager may benefit partially as can_logic function is common  and will be cached,
but it still needs to query roles from external source.

Performance results:

For single shard connection/disconnection scenario insns/conn decreased by *5%*,
allocs/conn decreased by *23%*, tasks/conn decreased by *20%*. Results for 20 shards are very similar.

Raw data before:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1128.55 tps (599.2 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op,        0 errors)
1157.41 tps (601.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op,        0 errors)
1167.42 tps (603.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op,        0 errors)
1159.63 tps (605.9 allocs/op,   0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op,        0 errors)
1165.12 tps (608.8 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op,        0 errors)
throughput:
	mean=   1155.63 standard-deviation=15.66
	median= 1159.63 median-absolute-deviation=9.49
	maximum=1167.42 minimum=1128.55
instructions_per_op:
	mean=   2602934.31 standard-deviation=16063.01
	median= 2603234.19 median-absolute-deviation=13887.96
	maximum=2625804.05 minimum=2586609.82
cpu_cycles_per_op:
	mean=   1359576.30 standard-deviation=5945.69
	median= 1360607.05 median-absolute-deviation=4358.94
	maximum=1365736.42 minimum=1350912.10
```

Raw data after:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1132.09 tps (457.5 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op,        0 errors)
1157.70 tps (458.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op,        0 errors)
1162.86 tps (459.0 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op,        0 errors)
1153.15 tps (460.2 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op,        0 errors)
1142.09 tps (460.6 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op,        0 errors)
1124.89 tps (462.5 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op,        0 errors)
1156.75 tps (464.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op,        0 errors)
1152.16 tps (466.3 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op,        0 errors)
1154.77 tps (469.8 allocs/op,   0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op,        0 errors)
1152.22 tps (472.4 allocs/op,   0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op,        0 errors)
throughput:
	mean=   1148.87 standard-deviation=12.08
	median= 1153.15 median-absolute-deviation=7.88
	maximum=1162.86 minimum=1124.89
instructions_per_op:
	mean=   2487755.88 standard-deviation=43838.23
	median= 2478900.02 median-absolute-deviation=24531.06
	maximum=2571954.26 minimum=2432485.38
cpu_cycles_per_op:
	mean=   1304144.76 standard-deviation=22129.55
	median= 1305025.71 median-absolute-deviation=12363.25
	maximum=1345341.16 minimum=1270655.17
```

Fixes https://github.com/scylladb/scylladb/issues/18891
Backport: no, it's a new feature

Closes scylladb/scylladb#26841

* github.com:scylladb/scylladb:
  auth: use auth cache on login path
  auth: corutinize standard_role_manager::can_login
  main: auth: add auth cache dependency to auth service
  raft: update auth cache when data changes
  auth: storage_service: reload auth cache on v1 to v2 auth migration
  raft: reload auth cache on snapshot application
  service: add auth cache getter to storage service
  main: start auth cache service
  auth: add unified cache implementation
  auth: move table names to common.hh
2025-12-03 16:45:01 +02:00
Wojciech Mitros
6221c58325 mv: replace the simple/complex rack-aware pairing with exact rack matching
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.

Fixes https://github.com/scylladb/scylladb/issues/26313
2025-12-02 10:52:36 +01:00
Wojciech Mitros
c313b215e4 mv: unify self-pairing and rack-aware pairing into one bool
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
2025-12-02 03:32:32 +01:00
Avi Kivity
d4be9a058c Update seastar submodule
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.

* seastar 8c3fba7a...b5c76d6b (3):
  > testing: There can be only one memory_data_sink
  > util: Use std::source_location directly
  > Merge 'net: support proxy protocol v2' from Avi Kivity
    apps: httpd: add --load-balancing-algorithm
    apps: httpd: add /shard endpoint
    test: socket_test: add proxy protocol v2 test suite
    test: socket_test: test load balancer with proxy protocol
    net: posix_connected_socket: specialize for proxied connections
    net: posix_server_socket_impl: implement proxy protocol in server sockets
    net: posix_server_socket_impl: adjust indentation
    net: posix_server_socket_impl: avoid immediately-invoked lambda
    net: conntrack: complete handle nested class special member functions
    net: posix_server_socket_impl: coroutinize accept()

Closes scylladb/scylladb#27316
2025-11-30 12:38:47 +02:00
Calle Wilund
59c87025d1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236
2025-11-28 15:26:46 +03:00
Michael Litvak
97b7c03709 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312
2025-11-28 11:17:12 +01:00
Pavel Emelyanov
54edb44b20 code: Stop using seastar::compat::source_location
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.

The patch is

  for f in $(git grep -l 'seastar::compat::source_location'); do
    sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
  done

and removal of few header includes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27309
2025-11-27 19:10:11 +02:00
Andrzej Jackowski
e366030a92 treewide: seastar module update
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.

The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
 - Remove `seastar/util/modules.hh` includes as the file was removed
   from seastar
 - Modified `metrics::impl::labels_type` construction in
   `test/boost/group0_test.cc` because now it requires `escaped_string`

* seastar 340e14a7...8c3fba7a (32):
  > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
    dns: Optimize packet sending for newer c-ares versions
    dns: Replace net::packet with vector<temporary_buffer>
    dns: Remove unused local variable
    dns: Remove pointless for () loop wrapping
    dns: Introduce do_sendv_tcp() method
    dns: Introduce do_send_udp() method
  > test: Add http rules test of matching order
  > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
    memcached: Patch test to use memory_data_source
    memcached: Use memory_data_source in server
    rpc: Use memory_data_sink without constructing net::packet
    util: Generalize packet_data_source into memory_data_source
  > tests: coroutines: restore "explicit this" tests
  > reactor: remove blocking of SIGILL
  > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
    github: Use gcc-14
    github: Use clang-20
  > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
    test: Make test_resolve() try several addresses
    test: Coroutinize test_resolve() helper
  > modules: make module support standards-compliant
  > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
    dns: Squash two if blocks together
    dns: Do not check tcp entry for udp type
  > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
  > promise: Document that promise is resolved at most once
  > coroutine: exception: workaround broken destroy coroutine handle in await_suspend
  > socket: Return unspecified socket_address for unconnected socket
  > smp: Fix exception safety of invoke_on_... internal copying
  > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
    reactor: Keep loads timer on reactor
    reactor: Update loads evaluation loop
  > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
    test: extend tests/unit/api.json to use 'integer' type
    scripts: add 'integer' type to seastar-json2code
  > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
    tls: Rename do_put_one(temporary_buffer) into do_put()
    tls: Fix indentation after previous patch
    tls: Move semaphore grab into iterating do_put()
  > net: tcp: change unsent queue from packets to temporary_buffer:s
  > timer: Enable highres timer based on next timeout value
  > rpc: Add a new constructor in closed_error to accept string argument
  > memcache: Implement own data sink for responses
  > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
    file: do_recursive_remove_directory(): move object when popping from queue
    file: do_recursive_remove_directory(): adjust indentation
    file: do_recursive_remove_directory(): coroutinize
    file: do_recursive_remove_directory(): simplify conditional
    file: do_recursive_remove_directory(): remove wrong const
    file: do_recursive_remove_directory(): clean up work_entry
  > tests: Move thread_context_switch_test into perf/
  > test: Add unit test for append_challenged_posix_file
  > Merge 'Prometheus metrics handler optimization' from Travis Downs
    prometheus: optimize metrics aggregation
    prometheus: move and test aggregate_by helper
    prometheus: various optimizations
    metrics: introduce escaped_string for label values
    metric:value: implement + in terms of +=
    tests: add prometheus text format acceptance tests
    extract memory_data_sink.hh
    metrics_perf: enhance metrics bench
  > demos: Simplify udp_zero_copy_demo's way of preparing the packet
  > metrics: Remove deprecated make_...-ers
  > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
    test: Use BOOST_REQUIRE checkers
    test: Replace some SEASTAR_ASSERT-s with static_assert-s
    test: Convert slab test into boost kind
  > Merge 'Coroutinize lister_test' from Pavel Emelyanov
    test: Fix indentation after previuous patch
    test: Coroutinize lister_test lister::report() method
    test: Coroutinize lister_test main code
  > file: recursive_remove_directory(): use a list instead of a deque
  > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
    tls: Stop using net::packet in session::put()
    tls: Fix indentation after previous patch
    tls: Split session::do_put()
    tls: Mark some session methods private

Closes scylladb/scylladb#27240
2025-11-27 12:34:22 +02:00
Wojciech Mitros
323e5cd171 mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Fixes https://github.com/scylladb/scylladb/issues/27070
2025-11-27 00:02:28 +01:00
Patryk Jędrzejczak
cc273e867d Merge 'fix notification about expiring erm held for to long' from Gleb Natapov
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

Closes scylladb/scylladb#27140

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-26 12:59:00 +01:00
Marcin Maliszkiewicz
b29c42adce main: auth: add auth cache dependency to auth service
In the following commit we'll switch some authorizer
and role manager code to use the cache so we're preparing
the dependency.
2025-11-26 12:01:31 +01:00
Gleb Natapov
5dcdaa6f66 test: test that expired erm that held for too long triggers notification 2025-11-25 17:33:54 +02:00
Karol Nowacki
ca62effdd2 vector_search: Restrict vector index tests to tablets only
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.

This change ensures tests using vector indexes run exclusively with tablets.

Fixes: VECTOR-49

Closes scylladb/scylladb#26843
2025-11-25 09:26:16 +02:00
Botond Dénes
296d7b8595 Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk
This patch enables integrity check in  'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.

These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded  - the cost comes from reading, calculating and verifying the diges.

New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.

Backport is not required, since it is a new feature

Fixes #21776

Closes scylladb/scylladb#26702

* github.com:scylladb/scylladb:
  file_stream_test: add sstable file streaming integrity verification test cases
  streaming: prioritize sender-side errors in tablet_stream_files
  sstables: enable integrity check for data file streaming
  sstables: Add compressed raw streaming support
  sstables: Allow to read digest and checksum from user provided file instance
  sstables: add overload of data_stream() to accept custom file_input_stream_options
2025-11-24 06:37:27 +02:00
Aleksandra Martyniuk
76174d1f7a cql3: reject ALTER KEYSPACE if rf of datacenter with tablets is omitted
In ALTER KEYSPACE, when a datacenter name is omitted, its replication
factor is implicitly set to zero with vnodes, while with tablets,
it remains unchanged.

ALTER KEYSPACE should behave the same way for tablets as it does
for vnodes. However, this can be dangerous as we may mistakenly
drop the whole datacenter.

Reject ALTER KEYSPACE if it changes replication factor, but omits
a datacenter that currently contains tablet replicas.

Fixes: https://github.com/scylladb/scylladb/issues/25549.

Closes scylladb/scylladb#25731
2025-11-24 06:36:51 +02:00
Avi Kivity
85db7b1caf Merge 'address_map: Use more efficient and reliable replication method' from Tomasz Grabiec
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updates queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated, because we update mapping on each change of gossip
states. This made bootstrap impossible because nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865

Closes scylladb/scylladb#26941

* github.com:scylladb/scylladb:
  address_map: Use barrier() to wait for replication
  address_map: Use more efficient and reliable replication method
  utils: Introduce helper for replicated data structures
2025-11-23 19:15:12 +02:00
Taras Veretilnyk
3003669c96 file_stream_test: add sstable file streaming integrity verification test cases
Add 'test_sstable_stream' to verify SSTable file streaming integrity check.
The new tests cover both compressed and uncompressed SSTables and includes:
- Checksum mismatch detection verification
- Digest mismatch detection verifivation
2025-11-21 12:52:35 +01:00
Calle Wilund
03408b185e utils::gcp::object_storage: Fix buffer alignment reordering trailing data
Fixes #26874

Due to certain people (me) not being able to tell forward from backward,
the data alignment to ensure partial uploads adhere to the 256k-align
rule would potentially _reorder_ trailing buffers generated iff the
source buffers input into the sink are small enough. Which, as a fun fact,
they are in backup upload.

Change the unit test to use raw sink IO and add two unit tests (of which
the smaller size provokes the bug) that checks the same 64k buf segmented
upload backup uses.

Closes scylladb/scylladb#26938
2025-11-21 09:36:13 +02:00
Gleb Natapov
ad3cf2c174 utils: fix get_random_time_UUID_from_micros to generate correct time uuid
According to the IETF spec uuid variant bits should be set to '10'. All
others are either invalid or reserved. The patch change the code to
follow the spec.

Closes scylladb/scylladb#27073
2025-11-20 10:27:29 +02:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Tomasz Grabiec
f83c4ffc68 address_map: Use barrier() to wait for replication
More efficient than 100 pings.

There was one ping in test which was done "so this shard notices the
clock advance". It's not necessary, since obsering completed SMP
call implies that local shard sees the clock advancement done within in.
2025-11-19 15:21:02 +01:00
Tomasz Grabiec
4a85ea8eb2 address_map: Use more efficient and reliable replication method
Primary issue with the old method is that each update is a separate
cross-shard call, and all later updated queue behind it. If one of the
shards has high latency for such calls, the queue may accumulate and
system will appear unresponsive for mapping changes on non-zero shards.

This happened in the field when one of the shards was overloaded with
sstables and compaction work, which caused frequent stalls which
delayed polling for ~100ms. A queue of 3k address updates
accumulated. This made bootstrap impossible, since nodes couldn't
learn about the IP mapping for the bootstrapping node and streaming
failed.

To protect against that, use a more efficient method of replication
which requires a single cross-shard call to replicate all prior
updates.

It is also more reliable, if replication fails transiently for some
reason, we don't give up and fail all later updates.

Fixes #26865
Fixes #26835
2025-11-19 15:21:02 +01:00
Tomasz Grabiec
ed8d127457 utils: Introduce helper for replicated data structures
Key goals:
  - efficient (batching updates)
  - reliable (no lost updates)

Will be used in data structures maintained on one designed owning
shard and replicated to other shards.
2025-11-19 15:21:02 +01:00
Michał Chojnowski
d8e299dbb2 sstables/trie/trie_writer: free nodes after they are flushed
Somehow, the line of code responsible for freeing flushed nodes
in `trie_writer` is missing from the implementation.

This effectively means that `trie_writer` keeps the whole index in
memory until the index writer is closed, which for many dataset
is a guaranteed OOM.

Fix that, and add some test that catches this.

Fixes scylladb/scylladb#27082

Closes scylladb/scylladb#27083
2025-11-19 14:54:16 +02:00
Botond Dénes
8579e20bd1 Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.

This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.

To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.

Backport is not required, it is an improvement

Refs #21776

Closes scylladb/scylladb#26444

* github.com:scylladb/scylladb:
  boost/repair_test: add repair reader integrity verification test cases
  test/lib: allow to disable compression in random_mutation_generator
  sstables: Skip checksum and digest reads for unlinked SSTables
  table: enable integrity checks for streaming reader
  table: Add integrity option to table::make_sstable_reader()
  sstables: Add integrity option to create_single_key_sstable_reader
2025-11-14 18:00:33 +02:00
Taras Veretilnyk
e7ceb13c3b boost/repair_test: add repair reader integrity verification test cases
Adds test cases to verify that repair_reader correctly detects SSTable(both comprossed and uncompressed) checksum mismatch.
Digest mismatch verification is not possible as repair readar may skip some sstable data, which automatically disables digest verification.

Each test corrupts the Data component on disk and ensures the reader throws a malformed_sstable_exception with the expected error message.
2025-11-13 14:08:33 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Tomasz Grabiec
10b893dc27 Merge 'load_stats: fix bug in migrate_tablet_size()' from Ferenc Szili
`topology_cooridinator::migrate_tablet_size()` was introduced in 10f07fb95a. It has a bug where the has_tablet_size() lambda always returns false because of bad comparison of iterators after a table and tablet search:

```
if (auto table_i = tables.find(gid.table); table_i != tables.find(gid.table)) {
    if (auto size_i = table_i->second.find(trange); size_i != table_i->second.find(trange)) {
```

This change also fixes a problem where the `migrate_tablet_size()` would crash with a `std::out_of_range` if the pending node was not present in load_stats.

This change fixes these two problems and moves the functionality into a separate method of `load_stats`. It also adds tests for the new method.

A version containing this bug has not been released yet, so no backport is needed.

Closes scylladb/scylladb#26946

* github.com:scylladb/scylladb:
  load_stats: add test for migrate_tablet_size()
  load_stats: fix problem with tablet size migration
2025-11-12 23:48:37 +01:00
Ferenc Szili
fcbc239413 load_stats: add test for migrate_tablet_size()
This change adds tests which validate the functionality of
load_stats::migrate_tablet_size()
2025-11-11 14:28:31 +01:00
Benny Halevy
a290505239 utils: stall_free: add dispose_gently
dispose_gently consumes the object moved to it,
clearing it gently before it's destroyed.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>

Closes scylladb/scylladb#26356
2025-11-11 12:20:18 +02:00
Pavel Emelyanov
decf86b146 Merge 'Make AWS & Azure KMS boost testing use fixture + include Azure in pytests' from Calle Wilund
* Adds test fixture for AWS KMS
* Adds test fixture for Azure KMS
* Adds key provider proxy for Azure to pytests (ported dtests)
* Make test gather for boost tests handle suites
* Fix GCP test snafu

Fixes #26781
Fixes #26780
Fixes #26776
Fixes #26775

Closes scylladb/scylladb#26785

* github.com:scylladb/scylladb:
  gcp_object_storage_test: Re-enable parallelism.
  test::pylib: Add azure (mock) testing to EAR matrix
  test::boost::encryption_at_rest: Remove redundant azure test indent
  test::boost::encryption_at_rest: Move azure tests to use fixture
  test::lib: Add azure mock/real server fixture
  test::pylib::boost: Fix test gather to handle test suites
  utils::gcp::object_storage: Fix typo in semaphore init
  test::boost::encryption_at_rest_test: Remove redundant indent
  test::boost::test_encryption_at_rest: Move to AWS KMS fixture for kms test
  test::boost::test_encryption_at_rest: Reorder tests and helpers
  ent::encryption: Make text helper routines take std::string
  test::pylib::dockerized_service: Handle docker/podman bind error message
  test::lib::aws_kms_fixture: Add a fixture object to run mock AWS KMS
  test::lib::gcs_fixture: Only set port if running docker image + more retry
2025-11-10 14:35:05 +03:00
Botond Dénes
cdba3bebda Merge 'Generalize directory checks in database_test's snapshot test cases' from Pavel Emelyanov
Those test cases use lister::scan_dir() to validate the contents of snapshot directory of a table against this table's base directory. This PR generalizes the listing code making it shorter.

Also, the snapshot_skip_flush_works case is missing the check for "schema.cql" file. Nothing is wrong with it, but the test is more accurate if checking it.

Also, the snapshot_with_quarantine_works case tries to check if one set of names is sub-set of another using lengthy code. Using std::includes improves the test readability a lot.

Also, the PR replaces lister::scan_dir() with directory_lister. The former is going to be removed some day (see also #26586)

Improving existing working test, no backport is needed.

Closes scylladb/scylladb#26693

* github.com:scylladb/scylladb:
  database_test: Simplify snapshot_with_quarantine_works() test
  database_test: Improve snapshot_skip_flush_works test
  database_test: Simplify snapshot_works() tests
  database_test: Use collect_files() to remove files
  database_test: Use collectz_files() to count files in directory
  database_test: Introduce collect_files() helper
2025-11-07 16:04:02 +02:00
Michał Chojnowski
b82c2aec96 sstables/trie: fix an assertion violation in bti_partition_index_writer_impl::write_last_key
_last_key is a multi-fragment buffer.

Some prefix of _last_key (up to _last_key_mismatch) is
unneeded because it's already a part of the trie.
Some suffix of _last_key (after needed_prefix) is unneeded
because _last_key can be differentiated from its neighbors even without it.

The job of write_last_key() is to find the middle fragments,
(containing the range `[_last_key_mismatch, needed_prefix)`)
trim the first and last of the middle fragments appropriately,
and feed them to the trie writer.

But there's an error in the current logic,
in the case where `_last_key_mismatch` falls on a fragment boundary.
To describe it with an example, if the key is fragmented like
`aaa|bbb|ccc`, `_last_key_mismatch == 3`, and `needed_prefix == 7`,
then the intended output to the trie writer is `bbb|c`,
but the actual output is `|bbb|c`. (I.e. the first fragment is empty).

Technically the trie writer could handle empty fragments,
but it has an assertion against them, because they are a questionable thing.

Fix that.

We also extend bti_index_test so that it's able to hit the assert
violation (before the patch). The reason why it wasn't able to do that
before the patch is that the violation requires decorated keys to differ
on the _first_ byte of a partition key column, but the keys generated
by the test only differed on the last byte of the column.
(Because the test was using sequential integers to make the values more
human-readable during debugging). So we modify the key generation
to use random values that can differ on any position.

Fixes scylladb/scylladb#26819

Closes scylladb/scylladb#26839
2025-11-07 11:25:07 +02:00