Commit Graph

4398 Commits

Author SHA1 Message Date
Michael Litvak
3a06c32749 schema_registry: fix learning a schema with cdc schema
When learning a schema that has a linked cdc schema, we need to learn
also the cdc schema, and at the end the schema should point to the
learned cdc schema.

This is needed because the linked cdc schema is used for generating cdc
mutations, and when we process the mutations later it is assumed in some
places that the mutation's schema has a schema registry entry.

We fix a scenario where we could end up with a schema that points to a
cdc schema that doesn't have a schema registry entry. This could happen
for example if the schema is loaded before it is learned, so when we
learn it we see that it already has an entry. In that case, we need to
set the cdc schema to the learned cdc schema as well, because it could
have been loaded previously with a cdc schema that was not learned.

Fixes scylladb/scylladb#27610

Closes scylladb/scylladb#27704
2025-12-17 20:01:00 +02:00
Tomasz Grabiec
c077283352 Merge 'service: support conversion of tablet keyspaces to rack-list using ALTER KEYSPACE' from Aleksandra Martyniuk
If a keyspace has a numeric replication factor in a DC and rf < #racks,
then the replicas of tablets in this keyspace can be distributed among
all racks in the DC (different for each tablet). With rack list, we need all
tablet replicas to be placed on the same racks. Hence, the conversion
requires tablet co-location.

After this series, the conversion can be done using ALTER KEYSPACE
statement. The statement that does this conversion in any DC is not
allowed to change a rf in any DC. So, if we have dc1 and dc2 with 3 racks
each and a keyspace ks then with a single ALTER KEYSPACE we can do:
- {dc1 : 2} -> {dc1 : [r1, r2]};
- {dc1 : 2, dc2: 2} ->  {dc1 : [r1, r2], dc2: [r2,r3]};
- {dc1 : 2, dc2: 2} -> {dc1 : [r1, r2], dc2: 2}
- {dc1 : 2} -> {dc1 : 2, dc2 : [r1]}
But we cannot do:
- {dc1 : 2} -> {dc1 : [r1, r2, r3]};
- {dc1 : 1, dc2 : [r1, r2] → dc1: [r1], dc2: [r1].

In order to do the co-locations rf change request is paused. Tablet
load balancer examines the paused rf change requests and schedules
necessary tablet migrations. During the process of co-location, no other
cross-rack migration is allowed.

Load balancer checks whether any paused rf change request is
ready to be resumed. If so, it puts the request back to global topology
request queue.

While an rf change request for a keyspace is running, any other rf change
of this keyspace will fail.

Fixes: #26398.

New feature, no backport

Closes scylladb/scylladb#27279

* github.com:scylladb/scylladb:
  test: add est_rack_list_conversion_with_two_replicas_in_rack
  test: test creating tablet_rack_list_colocation_plan
  test: add test_numeric_rf_to_rack_list_conversion test
  tasks: service: add global_topology_request_virtual_task
  cql3: statements: allow altering from numeric rf to rack list
  service: topology_coordinator: pause keyspace_rf_change request
  service: implement make_rack_list_colocation_plan
  service: add tablet_rack_list_colocation_plan
  cql3: reject concurrent alter of the same keyspace
  test: check paused rf change requests persistence
  db: service: add paused_rf_change_requests to system.topology
  service: pass topology and system_keyspace to load_balancer ctor
  service: tablet_allocator: extract load updates
  service: tablet_allocator: extract ensure_node
  tasks, system_keyspace: Introduce get_topology_request_entry_opt()
  node_ops: Drop get_pending_ids()
  node_ops: Drop redundant get_status_helper()
2025-12-17 10:05:06 +01:00
Botond Dénes
f06db096bd test/boost/reader_concurrency_semaphore_test: un-flake memory limit engages test
This test was observed to fail multiple times recently in promotion,
because there were successful reads. The failure only reproduces on
arm64, it doesn't reproduce on x86.
The suspected reason is that the data set is too close to the edge,
where all reads fail due to too high memory consumption. Reduce the
number of sstables used by this test to 54 (from 64).

Fixes: #27248

Closes scylladb/scylladb#27650
2025-12-16 19:24:49 +03:00
Aleksandra Martyniuk
9d20f0a3d2 test: add est_rack_list_conversion_with_two_replicas_in_rack 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
0476e8d272 test: test creating tablet_rack_list_colocation_plan 2025-12-16 13:31:24 +01:00
Aleksandra Martyniuk
1884e655d6 cql3: statements: allow altering from numeric rf to rack list
Allow altering from numeric replication factor to rack list. Ensure
that a single ALTER KEYSPACE statement doesn't try to both convert
to rack list and change rf.
2025-12-16 13:29:08 +01:00
Aleksandra Martyniuk
b3a0e4c2dc test: check paused rf change requests persistence 2025-12-16 13:25:38 +01:00
Aleksandra Martyniuk
d66a36058b service: pass topology and system_keyspace to load_balancer ctor
Pass a pointer to service::topology and db::system_keyspace to load
balancer. It will be used in the following patches to create
rack_list_colocation plan.
2025-12-16 13:25:38 +01:00
Botond Dénes
85f05fbe1b Revert "Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk"
This reverts commit 866c96f536, reversing
changes made to 367633270a.

This change caused all longevities to fail, with a crash in parsing
scylla-metadata. The investigation is still ongoing, with no quick fix
in sight yet.

Fixes: #27496

Closes scylladb/scylladb#27518
2025-12-16 11:34:40 +02:00
Pavel Emelyanov
3f7ee3ce5d Merge 'batchlog: make replay (flush) faster' from Botond Dénes
The batchlog table contains an entry for each logged batch that is processed by the local node as coordinator. These entries are typically very short lived, they are inserted when the batch is processed and deleted immediately after the batch is successfully applied.
When a table has `tombstone_gc = {'mode': 'repair'}` enabled, every repair has to flush all hints and batchlogs, so that we can be certain that there is no live data in any of these, older than the last repair. Since batches can contain member queries from any number of tables, the whole batchlog has to be flushed, even if repair-mode tombstone-gc is enabled for a single table.

Flushing the batchlog table happens by doing a batchlog replay. This involves reading the entire content of this table, and attempting to replay+delete any live entries (that are old enough to be replayed).  Under normal operating circumstances, 99%+ of the content of the batchlog table is partition tombstones.  Because of this, scanning the content of this table has to process thousands to millions of tombstones. This was observed to require up to 20 minutes to finish, causing repairs to slow down to a crawl, as the batchlog-flush has to be repeated at the end of the repair of each token-range.

When trying to address this problem, the first idea was that we should expedite the garbage-collection of these accumulated tombstones. This experiment failed, see https://github.com/scylladb/scylladb/pull/23752. The commitlog proved to be an impossible to bypass barrier, preventing quick garbage-collection of tombstones. So long as a single commit-log segment is alive, holding content from the batchlog table, all tombstones written after are blocked from GC.
The second approach, represented by this PR, is to not rely in tombstone GC to reduce the tombstone amount. Instead restructure the table such that a single higher-order tombstone can be used to shadow and allow for the eviction of the myriads of individual batchlog entry tombstones. This is realized by reorganizing the batchlog table such that individual batches are rows, not partitions.
This new schema is introduced by the new `system.batchlog_v2` table, introduced by this PR:

    CREATE TABLE system.batchlog_v2 (
        version int,
        stage int,
        shard int,
        written_at timestamp,
        id uuid,
        data blob,
        PRIMARY KEY ((version, stage, shard), written_at, id));

The new schema organization has the following goals:
1) Make post-replay batchlog cleanup possible with a simple range-tombstone. This allows dropping the individual dead batchlog entries, as they are shadowed by a higher level tombstone. This enables dropping tombstones without tombstone GC.
2) To make the above possible, introduce the stage key component: batchlog entries that fail the first replay attempt, are moved to the failed_replay stage, so the initial stage can be cleaned up safely.
3) Spread out the data among Scylla shards, via the batchlog shard column.
4) Make batchlog entries ordered by the batchlog create time (id). This allows for selecting batchlogs to replay, without post-filtering of batchlogs that are too young to be replayed.

Fixes: https://github.com/scylladb/scylladb/issues/23358

This is an improvement, normally not a backport-candidate. We might override this and backport to allow wider use of `tombstone_gc: {'mode': 'repair'}`.

Closes scylladb/scylladb#26671

* github.com:scylladb/scylladb:
  db/config: change batchlog_replay_cleanup_after_replays default to 1
  test/boost/batchlog_manager_test: add test for batchlog cleanup
  replica/mutation_dump: always set position weight for clustering positions
  service/storage_proxy: s/batch_replay_throw/storage_proxy_fail_replay_batch/
  test/lib: introduce error_injection.hh
  utils/error_injection: add debug log to disable() and disable_all()
  test/lib/cql_test_env: forward config to batchlog
  test/lib/cql_test_env: add batch type to execute_batch()
  test/lib/cql_assertions: add with_size(predicate) overload
  test/lib/cql_assertions: add source location to fail messages
  test/lib/cql_assertions: columns_assertions: add assert_for_columns_of_each_row()
  test/lib/cql_assertions: rows_assertions::assert_for_columns_of_row(): add index bound check
  test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
  db/batchlog_manager: config: s/write_timeout/reply_timeot/
  db,service: switch to system.batchlog_v2
  db/system_keyspace: introduce system.batchlog_v2
  service,db: extract generation of batchlog delete mutation
  service,db: extract get_batchlog_mutation_for() from storage-proxy
  db/batchlog_manager: only consider propagation delay with tombstone-gc=repair
  db/batchlog_manager: don't drop entire batch if one mutations' table was dropped
  data_dictionary: table: add get_truncation_time()
  db/batchlog_manager: batch(): replace map_reduce() with simple loop
  db/batchlog_manager: finish coroutinizing replay_all_failed_batches
  db/batchlog_manager: improve replayAllFailedBatches logs
2025-12-15 15:05:19 +03:00
Nadav Har'El
c06e63daed Merge 'auth: start using SHA 512 hashing originated from musl with added yielding' from Andrzej Jackowski
This patch series contains the following changes:
 - Incorporation of `crypt_sha512.c` from musl to out codebase
 - Conversion of `crypt_sha512.c` to C++ and coroutinization
 - Coroutinization of `auth::passwords::check`
 - Enabling use of `__crypt_sha512` orignated from `crypt_sha512.c` for
   computing SHA 512 passwords of length <=255
 - Addition of yielding in the aforementioned hashing implementation.

The alien thread was a solution for reactor stalls caused by indivisible
password‑hashing tasks (https://github.com/scylladb/scylladb/issues/24524).
However, because there is only one alien thread, overall hashing throughput was reduced
(see, e.g., https://github.com/scylladb/scylla-enterprise/issues/5711). To address this,
the alien‑thread solution is reverted, and a hashing implementation
with yielding is introduced in this patch series.

Before this patch series, ScyllaDB used SHA-512 hashing provided
by the `crypt_r` function, which in our case meant using the implementation
from the `libxcrypt` library. Adding yielding to this `libxcrypt`
implementation is problematic, both due to licensing (LGPL) and because the
implementation is split into many functions across multiple files. In
contrast, the SHA-512 implementation from `musl libc` has a more
permissive license and is concise, which makes it easier to incorporate
into the ScyllaDB codebase.

The performance of this solution was compared with the previous
implementation that used one alien thread and the implementation
after the alien thread was reverted. The results (median) of
`perf-cql-raw` with `--connection-per-request 1 --smp 10` parameters
are as follows:
 - Alien thread: 41.5 new connections/s per shard
 - Reverted alien thread: 244.1 new connections/s per shard
 - This commit (yielding in hashing): 198.4 new connections/s per shard

The roughly 20% performance deterioration compared to
the old implementation without the alien thread comes from the fact
that the new hashing algorithm implemented in `utils/crypt_sha512.cc`
performs an expensive self-verification and stack cleanup.

On the other hand, with smp=10 the current implementation achieves
roughly 5x higher throughput than the alien thread. In addition,
due to yielding added in this commit, the algorithm is expected
to provide similar protection from stalls as the alien thread did.
In a test that in parallel started a cassandra-stress workload and
created thousands of new connections using python-driver, the values of
`scylla_reactor_stalls_count` metric were as follows:
 - Alien thread: 109 stalls/shard total
 - Reverted alien thread: 13186 stalls/shard total
 - This commit (yielding in hashing): 149 stalls/shard total

Similarly, the `scylla_scheduler_time_spent_on_task_quota_violations_ms`
values were:
 - Alien thread: 1087 ms/shard total
 - Reverted alien thread: 72839 ms/shard total
 - This commit (yielding in hashing): 1623 ms/shard total

To summarize, yielding during hashing computations achieves similar
throughput to the old solution without the alien thread but also
prevents stalls similarly to the alien thread.

Fixes: scylladb/scylladb#26859
Refs: scylladb/scylla-enterprise#5711

No automatic backport. After this PR is completed, the alien thread should be rather reverted from older branches (2025.2-2025.4 because on 2025.1 it's already removed). Backporting of the other commits needs further discussion.

Closes scylladb/scylladb#26860

* github.com:scylladb/scylladb:
  test/boost: add too_long_password to auth_passwords_test
  test/boost: add same_hashes_as_crypt_r to auth_passwords_test
  auth: utils: add yielding to crypt_sha512
  auth: change return type of passwords::check to future
  auth: remove code duplication in verify_scheme
  test/boost: coroutinize auth_passwords_test
  utils: coroutinize crypt_sha512
  utils: make crypt_sha512.cc to compile
  utils: license: import crypt_sha512.c from musl to the project
  Revert "auth: move passwords::check call to alien thread"
2025-12-14 14:01:01 +02:00
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
Avi Kivity
24264e24bb Revert "repair: Add tablet repair progress report support"
This reverts commit faad0167d7. It causes
a regression in

test_two_tablets_concurrent_repair_and_migration_repair_writer_level

in debug mode (with ~5%-10% probability).

Fixes #27510.

Closes scylladb/scylladb#27560
2025-12-11 12:18:11 +02:00
Andrzej Jackowski
11ad32c85e test/boost: add too_long_password to auth_passwords_test
The test documents the current behavior of hashing algorithms that
fail if the passphrase has 512 bytes or more.

Moreover, it documents the behavior of the current bcrypt
implementation that compares only the first 72 bytes of the password.
Although we don't typically use bcrypt for password hashing, it is
possible to insert such a hash using
`CREATE ROLE ... WITH HASHED PASSWORD ...`.

Refs: scylladb/scylladb#26842
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4c8c9cd548 test/boost: add same_hashes_as_crypt_r to auth_passwords_test
The test verifies that the old and new implementation of SHA-512
hashing returns exactly the same values.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
4ffdb0721f auth: change return type of passwords::check to future
Introduce a new `passwords::hash_with_salt_async` and change the return
type of `passwords::check` to `future<bool>`. This enables yielding
during password computations later in this patch series.

The old method, `hash_with_salt`, is marked as deprecated because
new code should use the new `hash_with_salt_async` function.
We are not removing `hash_with_salt` now to reduce the regression risk
of changing the hashing implementation—at least the methods that change
persistent hashes (CREATE, ALTER) will continue to use the old hashing
method. However, in the future, `hash_with_salt` should be entirely
removed.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Andrzej Jackowski
11eca621b0 test/boost: coroutinize auth_passwords_test
This commit prepares `auth_passwords_test` for using coroutines,
because later in this patch series `auth::passwords::check` and other
similar functions will return Seastar futures.

Refs: scylladb/scylladb#26859
2025-12-10 15:36:18 +01:00
Nadav Har'El
95e303faf3 Merge 'Refactor get_view_natural_endpoint' from Wojciech Mitros
With the introduction of rack-lists and the reliance of materialized views on them, the `get_view_natural_endpoint` function can be greatly simplified. When using tablets, instead of doing any index-matching, we can now pair base tables with views only in the same rack.
In this series we remove no longer needed code and reorganize the needed code for better clarity.
After the changes, the `get_view_natural_endpoint` function goes down from 245 lines to 85 lines, while the whole pairing-related text goes down from 346 lines to 239 lines.

Fixes https://github.com/scylladb/scylladb/issues/26313

Closes scylladb/scylladb#27383

* github.com:scylladb/scylladb:
  mv: replace the simple/complex rack-aware pairing with exact rack matching
  mv: split out vnode pairing code from get_view_natural_endpoint
  mv: unify self-pairing and rack-aware pairing into one bool
  mv: remove the workaround for left nodes when sending view updates
2025-12-09 13:19:13 +02:00
Pavel Emelyanov
fb32e1c7ee Merge 'streaming: tablet_sstable_streamer::stream refactoring' from Ernest Zaslavsky
Refactor the way we decide the sstable belong to a tablet, fully or partially to simplify the flow and make it more readable. Also extract the logic and make it testable, add tests to cover changes

The change is purely aesthetic, no need to backport

Closes scylladb/scylladb#27101

* github.com:scylladb/scylladb:
  streaming: remove unnecessary lambda creating sstable token range
  streaming: simplify get_sstables_for_tablets logic
  streaming: switch to range-based for loop
  streaming: drop sstable skip microoptimization in tablet loop
  streaming: replace reverse iterators with reverse view in sstables scan
  streaming: return from get_sstables_for_tablets earlier
  streaming: add get_sstables_by_tablet_range tests
  test,sstables: add helper to set sstable first and last keys
  streaming: refactor get_sstables_for_tablets to make it accessible
  streaming: refactor get_sstables_for_tablets to make it testable
  streaming: refactor tablet_sstable_streamer::stream by extracting SST filtering logic
2025-12-09 10:53:57 +03:00
Michał Jadwiszczak
51843195f7 test/boost/view_build_test: increase number of retires
Default number of retires in `eventually()` in `test_builder_with_concurrent_drop`
sometimes is not enough to observe changes in system tables on aarch64
builds.

This patch increases the number of retries to 30.

Fixes scylladb/scylladb#27370

Closes scylladb/scylladb#27493
2025-12-08 23:14:01 +02:00
Asias He
faad0167d7 repair: Add tablet repair progress report support
This patch adds tablet repair progress report support so that the user
could use the /task_manager/task_status API to query the progress.

In order to support this, a new system table is introduced to record the
user request related info, i.e, start of the request and end of the
request.

The progress is accurate when tablet split or merge happens in the
middle of the request, since the tokens of the tablet are recorded when
the request is started and when repair of each tablet is finished. The
original tablet repair is considered as finished when the finished
ranges cover the original tablet token ranges.

After this patch, the /task_manager/task_status API will report correct
progress_total and progress_completed.

Fixes #22564
Fixes #26896

Closes scylladb/scylladb#26924
2025-12-08 13:35:19 +02:00
Ernest Zaslavsky
6fd5160947 streaming: add get_sstables_by_tablet_range tests
Add a comprehensive test suite that exercises various combinations of
SSTable containment within tablet ranges. These cases cover boundary
conditions, partial overlaps, and full containment to validate all
recent changes made to `get_sstables_by_tablet_range`.
2025-12-08 12:30:23 +02:00
Pavel Emelyanov
8192f45e84 Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy
This change adds a new option to the REST api and correspondingly, to scylla nodetool: use_sstable_identifier.
When set, we use the sstable identifier, if available, to name each sstable in the snapshots directory
and the manifest.json file, rather than using the sstable generation.

This can be used by the user (e.g. Scylla Manager) for global deduplication with tablets, where an sstable
may be migrated across shards or across nodes, and in this case, its generation may change, but its
sstable identifier remains sstable.

Currently, Scylla manager uses the sstable generation to detect sstables that are already backed up to
object storage and exist in previous backed up snapshots.
Historically, the sstable generation was guaranteed to be unique only per table per node,
so the dedup code currently checks for deduplication in the node scope.

However, with tablet migration, sstables are renamed when migrated to a different shard,
i.e. their generation changes, and they may be renamed when migrated to another node,
but even if they are not, the dedup logic still assumes uniqueness only within a node.

To address both cases, we keep the sstable_id stable throughout the sstable life cycle (since 3a12ad96c7).
Given the globally unique sstable identifier, scylla manager can now detect duplicate sstables
in a wider scope.  This can be cluster-wide, but we practically need only rack-wide deduplication
or dc-wide, as tablets are migrated across racks only in rare occasions (like when converting from a
numerical replication factor to a rack list containing a subset of the available racks in a datacenter).

Fixes #27181

* New feature, no backport required

Closes scylladb/scylladb#27184

* github.com:scylladb/scylladb:
  database: truncate_table_on_all_shards: set use_sstable_identifier to true
  nodetool: snapshot: add --use-sstable-identifier option
  api: storage_service: take_snapshot: add use_sstable_identifier option
  test: database_test: add snapshot_use_sstable_identifier_works
  test: database_test: snapshot_works: add validate_manifest
  sstable: write_scylla_metadata: add random_sstable_identifier error injection
  table: snapshot_on_all_shards: take snapshot_options
  sstable: add get_format getter
  sstable: snapshot: add use_sstable_identifier option
  db: snapshot_ctl: snapshot_options: add use_sstable_identifier options
  db: snapshot_ctl: move skip_flush to struct snapshot_options
2025-12-08 12:56:12 +03:00
Botond Dénes
866c96f536 Merge 'Add digests for all sstable components in scylla metadata' from Taras Veretilnyk
This pull request adds support for calculation and storing CRC32 digests for all SSTable components.
This change replaces plain file_writer with crc32_digest_file_writer for all SSTable components that should be checksummed. The resulting component digests are stored in the sstable structure
and later persisted to disk as part of the Scylla metadata component during writer::consume_end_of_stream.
All important SSTable components (Index, Partitions, Rows, Summary, Filter, CompressionInfo, and TOC) are covered.
Several test cases where introduced to verify expected behaviour.

Backport is not required, it is a new feature

Fixes #20100

Closes scylladb/scylladb#27287

* github.com:scylladb/scylladb:
  sstable_test: add verification testcases of SSTable components digests persistance
  sstables: store digest of all sstable components in scylla metadata
  sstables: Add TemporaryScylla metadata component type
  sstables: Extract file writer closing logic into separate methods
  sstables: Add components_digests to scylla metadata components
  sstables: Implement CRC32 digest-only writer
2025-12-05 11:36:50 +02:00
Botond Dénes
367633270a Merge 'EAR: handle IPV6 hosts in KMIP and use shared (improved) http parser in AWS/Azure' from Calle Wilund
Fixes #27367
Fixes #27362
Fixes #27366

Makes http URL parser handle IPv6.
Makes KMIP host setup handle IPv6 hosts + use system trust if no truststore set
Moves Azure/KMS code to use shared http URL parser to avoid same regex everywhere.

Closes scylladb/scylladb#27368

* github.com:scylladb/scylladb:
  ear::kms/ear::azure: Use utils::http URL parsing
  ear::kmip_host: Handle ipv6 hosts + use system trust when not specified
  utils::http: Handle ipv6 numeric host part in URL:s
2025-12-05 10:43:07 +02:00
Taras Veretilnyk
0c8730ba05 sstable_test: add verification testcases of SSTable components digests persistance
Adds a generic test helper that writes a random SSTable, reloads it, and
verifies that the persisted CRC32 digest for each component matches the
digest computed from disk. Those covers all checksummed components test cases.
2025-12-04 21:09:01 +01:00
Calle Wilund
4e289e8e6a utils::http: Handle ipv6 numeric host part in URL:s
Fixes #27366

A URL with numeric host part formats special in case of ipv6,
to avoid confusion with port part.
The parser should handle this.

I.e.
http://[2001:db8:4006:812::200e]:8080

v2:
* Include scheme agnostic parse + case insensitive scheme matching
2025-12-04 11:38:41 +00:00
Benny Halevy
07b92a1ee8 test: database_test: add snapshot_use_sstable_identifier_works
Test that taking a snapshot with the use_sstable_identifier
option (and injecting `random_sstable_identifier`) produces
different file names in the snapshot than the original
sstable names and validate te manifest.json file respectively.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:57:38 +02:00
Benny Halevy
7504d10d9e test: database_test: snapshot_works: add validate_manifest
Validate the manifest.json format by loading it using rjson::parse
and then validate its contents to ensure it lists exactly the
SSTables present in the snapshot directory.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Botond Dénes
9d2f7c3f52 Merge 'mv: allow setting concurrency in PRUNE MATERIALIZED VIEW' from Wojciech Mitros
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Aside from the unit test, I checked manually on a 3-node cluster with 10M rows, using vnodes. There were actually no ghost rows in the test, but we still had to iterate over all view rows and read the corresponding base rows. And actual ghost rows, if there are any, should be a tiny fraction of all rows. I compared concurrencies 1,2,10,100 and the results were:
* Pruning with concurrency 1 took total 1416 seconds
* Pruning with concurrency 2 took total 731 seconds
* Pruning with concurrency 10 took total 234 seconds
* Pruning with concurrency 100 took total 171 seconds
So after a concurrency of 10 or so we're hitting diminishing returns (at least in this setup). At that point we may be no longer bottlenecked by the reads, but by CPU on the shard that's handling the PRUNE

Fixes https://github.com/scylladb/scylladb/issues/27070

Closes scylladb/scylladb#27097

* github.com:scylladb/scylladb:
  mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
  cql: add CONCURRENCY to the USING clause
2025-12-04 11:47:41 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Botond Dénes
b9199e8b24 Merge 'auth: use auth cache on login path' from Marcin Maliszkiewicz
Scylla currently has bad resiliency to connection storms. Nodes are easy to overload or impact their latency by unbound concurrency in making new connections on the client side. This can easily happen in bigger deployments where there are thousands of client instances, e.g. pods.

To improve resiliency we are introducing unified auth specialized cache to the system. This patch series is stage 1, where cache is used only on login path.

Dependency diagram:
```
|Authentication Layer|
            |
            v
+--------------------------------+
|          Auth Cache            |
+--------------------------------+
        ^                      |
        |                      |
        |                      v
|Raft Write Logic | | CQL Read Layer|
```

Cache invalidation is based on raft and the cache contains full content of related tables.

Ldap role manager may benefit partially as can_logic function is common  and will be cached,
but it still needs to query roles from external source.

Performance results:

For single shard connection/disconnection scenario insns/conn decreased by *5%*,
allocs/conn decreased by *23%*, tasks/conn decreased by *20%*. Results for 20 shards are very similar.

Raw data before:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=5, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1128.55 tps (599.2 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2586610 insns/op, 1350912 cycles/op,        0 errors)
1157.41 tps (601.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2589046 insns/op, 1356691 cycles/op,        0 errors)
1167.42 tps (603.3 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2603234 insns/op, 1360607 cycles/op,        0 errors)
1159.63 tps (605.9 allocs/op,   0.0 logallocs/op, 145.3 tasks/op, 2609977 insns/op, 1363935 cycles/op,        0 errors)
1165.12 tps (608.8 allocs/op,   0.0 logallocs/op, 145.2 tasks/op, 2625804 insns/op, 1365736 cycles/op,        0 errors)
throughput:
	mean=   1155.63 standard-deviation=15.66
	median= 1159.63 median-absolute-deviation=9.49
	maximum=1167.42 minimum=1128.55
instructions_per_op:
	mean=   2602934.31 standard-deviation=16063.01
	median= 2603234.19 median-absolute-deviation=13887.96
	maximum=2625804.05 minimum=2586609.82
cpu_cycles_per_op:
	mean=   1359576.30 standard-deviation=5945.69
	median= 1360607.05 median-absolute-deviation=4358.94
	maximum=1365736.42 minimum=1350912.10
```

Raw data after:
```
≡ ◦ ⤖ rm -rf /tmp/scylla-data && build/release/scylla perf-cql-raw --workdir /tmp/scylla-data --smp 1 --developer-mode 1 --username cassandra --password cassandra --connection-per-request true --duration 10 2> /dev/null
Running test with config: {workload=read, partitions=10000, concurrency=100, duration=10, ops_per_shard=0, auth, connection_per_request}
Pre-populated 10000 partitions
1132.09 tps (457.5 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2432485 insns/op, 1270655 cycles/op,        0 errors)
1157.70 tps (458.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2447779 insns/op, 1283768 cycles/op,        0 errors)
1162.86 tps (459.0 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2463225 insns/op, 1291782 cycles/op,        0 errors)
1153.15 tps (460.2 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2469230 insns/op, 1296381 cycles/op,        0 errors)
1142.09 tps (460.6 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2478900 insns/op, 1299342 cycles/op,        0 errors)
1124.89 tps (462.5 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2470962 insns/op, 1305026 cycles/op,        0 errors)
1156.75 tps (464.4 allocs/op,   0.0 logallocs/op, 115.1 tasks/op, 2493823 insns/op, 1305136 cycles/op,        0 errors)
1152.16 tps (466.3 allocs/op,   0.0 logallocs/op, 115.2 tasks/op, 2497246 insns/op, 1309816 cycles/op,        0 errors)
1154.77 tps (469.8 allocs/op,   0.0 logallocs/op, 115.5 tasks/op, 2571954 insns/op, 1345341 cycles/op,        0 errors)
1152.22 tps (472.4 allocs/op,   0.0 logallocs/op, 115.3 tasks/op, 2551954 insns/op, 1334202 cycles/op,        0 errors)
throughput:
	mean=   1148.87 standard-deviation=12.08
	median= 1153.15 median-absolute-deviation=7.88
	maximum=1162.86 minimum=1124.89
instructions_per_op:
	mean=   2487755.88 standard-deviation=43838.23
	median= 2478900.02 median-absolute-deviation=24531.06
	maximum=2571954.26 minimum=2432485.38
cpu_cycles_per_op:
	mean=   1304144.76 standard-deviation=22129.55
	median= 1305025.71 median-absolute-deviation=12363.25
	maximum=1345341.16 minimum=1270655.17
```

Fixes https://github.com/scylladb/scylladb/issues/18891
Backport: no, it's a new feature

Closes scylladb/scylladb#26841

* github.com:scylladb/scylladb:
  auth: use auth cache on login path
  auth: corutinize standard_role_manager::can_login
  main: auth: add auth cache dependency to auth service
  raft: update auth cache when data changes
  auth: storage_service: reload auth cache on v1 to v2 auth migration
  raft: reload auth cache on snapshot application
  service: add auth cache getter to storage service
  main: start auth cache service
  auth: add unified cache implementation
  auth: move table names to common.hh
2025-12-03 16:45:01 +02:00
Botond Dénes
8edd5b80ab test/boost/batchlog_manager_test: add test for batchlog cleanup
Add more tests covering different aspects of batchlog replay, cleanup,
replay timeout and finally v1 -> v2 migration.
2025-12-02 14:21:26 +02:00
Botond Dénes
d1b796bc43 test/lib/cql_test_env: add batch type to execute_batch()
Allow executing logged batches too. Also add a trace log to the method.
2025-12-02 14:21:26 +02:00
Botond Dénes
aa1d3f1170 test/lib/cql_assertions: columns_assertions: add T* with_typed_column() overload
To enable assertions on columns which are sometimes null.
One existing user of with_typed_column() needs adjustment, because the
previous version of with_typed_column() covered up silently for null
value, but after this patch this caused a failure.
2025-12-02 14:21:26 +02:00
Botond Dénes
846b656610 db,service: switch to system.batchlog_v2
New batchlogs are written to the batchlog_v2 table and replay also uses
the v2 table.
The content of system.batchlog is attempted to be migrated to
system.batchlog_v2 after each start of the batchlog_manager service.
The migration is retried on each replay if it fails. This is reduntant
but simple.

Batchlog cleanup now doesn't involve flushing memtables, the only
remaining user of replica/database.hh is gone, so the include is
dropped.
2025-12-02 14:21:26 +02:00
Botond Dénes
f54602daf0 service,db: extract get_batchlog_mutation_for() from storage-proxy
Don't build batchlog mutations in storage-proxy code. Move this code
into db/batchlog_manager.cc, exposed via db/batchlog.hh.
This serves multiple goals:
1) Concentrates low-level batchlog related logic in
   db/batchlog_manager.cc
2) Reduce current and future code duplication.
2) Make future changes to this logic easier.
2025-12-02 14:21:25 +02:00
Wojciech Mitros
6221c58325 mv: replace the simple/complex rack-aware pairing with exact rack matching
When the initial version of rack-aware pairing was introduced, materialized
views with tablets were still experimental. Since then, we decided that
we'll only allow materialized views in clusters where the base table and
the view are replicated on the same racks, with one replica of each tablet
on each rack.
This allows us to remove almost all logic from our base-view pairing. The
only check for the paired view replica is now whether it's in the same
rack as the base replica sending the update.
In this patch we replace the simple and complex rack-aware pairing with
the simple check above.
Because of this, we have to remove a test case from network_topology_strategy_test
which was testing complex pairing. The tested topology is not supported
for views with tablets (or is unlikely to be supported, as it's a random test),
so there's no use keeping the test.
The test case for simple rack aware pairing was kept, but now we only test
the case where each rack has one replica, not multiple.
Additionally, we split finding of an unpaired replica to a separate function
and partially rewrite it without reusing the helper stuctures that were
present when calculating the simple and complex rack-aware pairing.
We only look for an unpaired replica if we couldn't find a paired replica
ourselves or if the number of view replicas didn't match the base replicas.
If an unpaired replica appears while these conditions pass, we won't send
an extra update, but that would be a new bug altogether, because we only
expect the unpaired replica to appear during RF changes, so when these
conditions aren't fulfilled.

Fixes https://github.com/scylladb/scylladb/issues/26313
2025-12-02 10:52:36 +01:00
Wojciech Mitros
c313b215e4 mv: unify self-pairing and rack-aware pairing into one bool
We always use "legacy self pairing" when not using tablets, and
the "rack aware pairing" has been enabled in every version where
views with tablets isn't experimental. So in practice, instead
of checking these variables we can just look at whether the
table uses tablets.
2025-12-02 03:32:32 +01:00
Avi Kivity
d4be9a058c Update seastar submodule
seastar::compat::source_location (which should not have been used
outside Seastar) is replaced with std::source_location to avoid
deprecation warnings. The relevant header, which was removed, is no
longer included.

* seastar 8c3fba7a...b5c76d6b (3):
  > testing: There can be only one memory_data_sink
  > util: Use std::source_location directly
  > Merge 'net: support proxy protocol v2' from Avi Kivity
    apps: httpd: add --load-balancing-algorithm
    apps: httpd: add /shard endpoint
    test: socket_test: add proxy protocol v2 test suite
    test: socket_test: test load balancer with proxy protocol
    net: posix_connected_socket: specialize for proxied connections
    net: posix_server_socket_impl: implement proxy protocol in server sockets
    net: posix_server_socket_impl: adjust indentation
    net: posix_server_socket_impl: avoid immediately-invoked lambda
    net: conntrack: complete handle nested class special member functions
    net: posix_server_socket_impl: coroutinize accept()

Closes scylladb/scylladb#27316
2025-11-30 12:38:47 +02:00
Calle Wilund
59c87025d1 commitlog::read_log_file: Check for eof position on all data reads
Fixes #24346

When reading, we check for each entry and each chunk, if advancing there
will hit EOF of the segment. However, IFF the last chunk being read has
the last entry _exactly_ matching the chunk size, and the chunk ending
at _exactly_ segment size (preset size, typically 32Mb), we did not check
the position, and instead complained about not being able to read.

This has literally _never_ happened in actual commitlog (that was replayed
at least), but has apparently happened more and more in hints replay.

Fix is simple, just check the file position against size when advancing
said position, i.e. when reading (skipping already does).

v2:

* Added unit test

Closes scylladb/scylladb#27236
2025-11-28 15:26:46 +03:00
Michael Litvak
97b7c03709 tablet: scheduler: Do not emit conflicting migration in merge colocation
The tablet scheduler should not emit conflicting migrations for the same
tablet. This was addressed initially in scylladb/scylladb#26038 but the
check is missing in the merge colocation plan, so add it there as well.

Without this check, the merge colocation plan could generate a
conflicting migration for a tablet that is already scheduled for
migration, as the test demonstrates.

This can cause correctness problems, because if the load balancer
generates two migrations for a single tablet, both will be written as
mutations, and the resulting mutation could contain mixed cells from
both migrations.

Fixes scylladb/scylladb#27304

Closes scylladb/scylladb#27312
2025-11-28 11:17:12 +01:00
Pavel Emelyanov
54edb44b20 code: Stop using seastar::compat::source_location
And switch to std::source_location.
Upcoming seastar update will deprecate its compatibility layer.

The patch is

  for f in $(git grep -l 'seastar::compat::source_location'); do
    sed -e 's/seastar::compat::source_location/std::source_location/g' -i $f;
  done

and removal of few header includes.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#27309
2025-11-27 19:10:11 +02:00
Andrzej Jackowski
e366030a92 treewide: seastar module update
The reason for this seastar update is to have the fixed handling
of the `integer` type in `seastar-json2code` because it's needed
for further development of ScyllaDB REST API.

The following changes were introduced to ScyllaDB code to ensure it
compiles with the updated seastar:
 - Remove `seastar/util/modules.hh` includes as the file was removed
   from seastar
 - Modified `metrics::impl::labels_type` construction in
   `test/boost/group0_test.cc` because now it requires `escaped_string`

* seastar 340e14a7...8c3fba7a (32):
  > Merge 'Remove net::packet usage from dns.cc' from Pavel Emelyanov
    dns: Optimize packet sending for newer c-ares versions
    dns: Replace net::packet with vector<temporary_buffer>
    dns: Remove unused local variable
    dns: Remove pointless for () loop wrapping
    dns: Introduce do_sendv_tcp() method
    dns: Introduce do_send_udp() method
  > test: Add http rules test of matching order
  > Merge 'Generalize packet_data_source into memory_data_source' from Pavel Emelyanov
    memcached: Patch test to use memory_data_source
    memcached: Use memory_data_source in server
    rpc: Use memory_data_sink without constructing net::packet
    util: Generalize packet_data_source into memory_data_source
  > tests: coroutines: restore "explicit this" tests
  > reactor: remove blocking of SIGILL
  > Merge 'Update compilers in GH actions scripts' from Pavel Emelyanov
    github: Use gcc-14
    github: Use clang-20
  > Merge 'Reinforce DNS reverse resolution test ' from Pavel Emelyanov
    test: Make test_resolve() try several addresses
    test: Coroutinize test_resolve() helper
  > modules: make module support standards-compliant
  > Merge 'Fix incorrect union access in dns resolver' from Pavel Emelyanov
    dns: Squash two if blocks together
    dns: Do not check tcp entry for udp type
  > coroutine: Fix compilation of execute_involving_handle_destruction_in_await_suspend
  > promise: Document that promise is resolved at most once
  > coroutine: exception: workaround broken destroy coroutine handle in await_suspend
  > socket: Return unspecified socket_address for unconnected socket
  > smp: Fix exception safety of invoke_on_... internal copying
  > Merge 'Improve loads evaluation by reactor' from Pavel Emelyanov
    reactor: Keep loads timer on reactor
    reactor: Update loads evaluation loop
  > Merge 'scripts: add 'integer' type to seastar-json2code' from Andrzej Jackowski
    test: extend tests/unit/api.json to use 'integer' type
    scripts: add 'integer' type to seastar-json2code
  > Merge 'Sanitize tls::session::do_put(_one)? overloads' from Pavel Emelyanov
    tls: Rename do_put_one(temporary_buffer) into do_put()
    tls: Fix indentation after previous patch
    tls: Move semaphore grab into iterating do_put()
  > net: tcp: change unsent queue from packets to temporary_buffer:s
  > timer: Enable highres timer based on next timeout value
  > rpc: Add a new constructor in closed_error to accept string argument
  > memcache: Implement own data sink for responses
  > Merge 'file: recursive_remove_directory: general cleanup' from Avi Kivity
    file: do_recursive_remove_directory(): move object when popping from queue
    file: do_recursive_remove_directory(): adjust indentation
    file: do_recursive_remove_directory(): coroutinize
    file: do_recursive_remove_directory(): simplify conditional
    file: do_recursive_remove_directory(): remove wrong const
    file: do_recursive_remove_directory(): clean up work_entry
  > tests: Move thread_context_switch_test into perf/
  > test: Add unit test for append_challenged_posix_file
  > Merge 'Prometheus metrics handler optimization' from Travis Downs
    prometheus: optimize metrics aggregation
    prometheus: move and test aggregate_by helper
    prometheus: various optimizations
    metrics: introduce escaped_string for label values
    metric:value: implement + in terms of +=
    tests: add prometheus text format acceptance tests
    extract memory_data_sink.hh
    metrics_perf: enhance metrics bench
  > demos: Simplify udp_zero_copy_demo's way of preparing the packet
  > metrics: Remove deprecated make_...-ers
  > Merge 'Make slab_test be BOOST kind' from Pavel Emelyanov
    test: Use BOOST_REQUIRE checkers
    test: Replace some SEASTAR_ASSERT-s with static_assert-s
    test: Convert slab test into boost kind
  > Merge 'Coroutinize lister_test' from Pavel Emelyanov
    test: Fix indentation after previuous patch
    test: Coroutinize lister_test lister::report() method
    test: Coroutinize lister_test main code
  > file: recursive_remove_directory(): use a list instead of a deque
  > Merge 'Stop using packets in tls data_sink and session' from Pavel Emelyanov
    tls: Stop using net::packet in session::put()
    tls: Fix indentation after previous patch
    tls: Split session::do_put()
    tls: Mark some session methods private

Closes scylladb/scylladb#27240
2025-11-27 12:34:22 +02:00
Wojciech Mitros
323e5cd171 mv: allow setting concurrency in PRUNE MATERIALIZED VIEW
The PRUNE MATERALIZED VIEW statement is performed as follows:
1. Perform a range scan of the view table from the view replicas based
on the ranges specified in the statement.
2. While reading the paged scan above, for each view row perform a read
from all base replicas at the corresponding primary key. If a discrepancy
is detected, delete the row in the view table.

When reading multiple rows, this is very slow because for each view row
we need to performe a single row query on multiple replicas.
In this patch we add an option to speed this up by performing many of the
single base row reads concurrently, at the concurrency specified in the
USING CONCURRENCY clause.

Fixes https://github.com/scylladb/scylladb/issues/27070
2025-11-27 00:02:28 +01:00
Patryk Jędrzejczak
cc273e867d Merge 'fix notification about expiring erm held for to long' from Gleb Natapov
Commit 6e4803a750 broke notification about expired erms held for too long since it resets the tracker without calling its destructor (where notification is triggered). Fix the assign operator to call the destructor like it should.

Fixes https://github.com/scylladb/scylladb/issues/27141

Closes scylladb/scylladb#27140

* https://github.com/scylladb/scylladb:
  test: test that expired erm that held for too long triggers notification
  token_metadata: fix notification about expiring erm held for to long
2025-11-26 12:59:00 +01:00
Marcin Maliszkiewicz
b29c42adce main: auth: add auth cache dependency to auth service
In the following commit we'll switch some authorizer
and role manager code to use the cache so we're preparing
the dependency.
2025-11-26 12:01:31 +01:00
Gleb Natapov
5dcdaa6f66 test: test that expired erm that held for too long triggers notification 2025-11-25 17:33:54 +02:00
Karol Nowacki
ca62effdd2 vector_search: Restrict vector index tests to tablets only
Vector indexes are going to be supported only for tablets (see VECTOR-322).
As a result, tests using vector indexes will be failing when run with vnodes.

This change ensures tests using vector indexes run exclusively with tablets.

Fixes: VECTOR-49

Closes scylladb/scylladb#26843
2025-11-25 09:26:16 +02:00
Botond Dénes
296d7b8595 Merge 'Enable digest+checksum verification for file based streaming' from Taras Veretilnyk
This patch enables integrity check in  'create_stream_sources()' by introducing a new 'sstable_data_stream_source_impl' class for handling the Data component of SSTables. The new implementation uses 'sstable::data_stream()' with 'integrity_check::yes' instead of the raw input_stream.

These additional checks require reading the digest and CRC components from disk, which may introduce some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data.
For compressed SSTables - where checksums are already embedded  - the cost comes from reading, calculating and verifying the diges.

New test cases were added to verify that the integrity checks work correctly, detecting both data and digest mismatches.

Backport is not required, since it is a new feature

Fixes #21776

Closes scylladb/scylladb#26702

* github.com:scylladb/scylladb:
  file_stream_test: add sstable file streaming integrity verification test cases
  streaming: prioritize sender-side errors in tablet_stream_files
  sstables: enable integrity check for data file streaming
  sstables: Add compressed raw streaming support
  sstables: Allow to read digest and checksum from user provided file instance
  sstables: add overload of data_stream() to accept custom file_input_stream_options
2025-11-24 06:37:27 +02:00