Commit Graph

1721 Commits

Author SHA1 Message Date
Botond Dénes
8579e20bd1 Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.

This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.

To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.

Backport is not required, it is an improvement

Refs #21776

Closes scylladb/scylladb#26444

* github.com:scylladb/scylladb:
  boost/repair_test: add repair reader integrity verification test cases
  test/lib: allow to disable compression in random_mutation_generator
  sstables: Skip checksum and digest reads for unlinked SSTables
  table: enable integrity checks for streaming reader
  table: Add integrity option to table::make_sstable_reader()
  sstables: Add integrity option to create_single_key_sstable_reader
2025-11-14 18:00:33 +02:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Michael Litvak
7cc6b0d960 replica/db: add counter update guard
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.

This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
2025-11-03 08:43:11 +01:00
Michael Litvak
88fd9a34c4 replica/db: split counter update helper functions
Split do_apply_counter_update to a few smaller and simpler functions to
help prepare for a later change.
2025-11-03 08:43:11 +01:00
Lakshmi Narayanan Sreethar
3eb7193458 backlog_controller: compute backlog even when static shares are set
The compaction manager backlog is exposed via metrics, but if static
shares are set, the backlog is never calculated. As a result, there is
no way to determine the backlog and if the static shares need
adjustment. Fix that by calculating backlog even when static shares are
set.

Fixes #26287

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26778
2025-10-31 18:18:36 +02:00
Tomasz Grabiec
1c0d847281 Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.

This is the second part of the size based load balancing changes:

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26152

* github.com:scylladb/scylladb:
  load_balancer: load_stats reconcile after tablet migration and table resize
  load_stats: change data structure which contains tablet sizes
2025-10-31 09:58:25 +01:00
Taras Veretilnyk
e62ebdb967 table: enable integrity checks for streaming reader
Previously, streaming readers only verified the checksum of compressed SSTables.
This patch extends checks to also include the digest and the uncompressed checksum (CRC).

These additional checks require reading the digest and CRC components from disk,
which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data,
while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.
If the reader range doesn't cover the full SSTable, the digest check is skipped.
2025-10-28 19:27:35 +01:00
Taras Veretilnyk
06e1b47ec6 table: Add integrity option to table::make_sstable_reader() 2025-10-28 19:27:35 +01:00
Taras Veretilnyk
deb8e32e86 sstables: Add integrity option to create_single_key_sstable_reader
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
2025-10-28 19:27:35 +01:00
Avi Kivity
d81796cae3 Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

Fixes https://github.com/scylladb/scylladb/issues/25341

Closes scylladb/scylladb#25456

* github.com:scylladb/scylladb:
  mv: limit concurrent view updates from all sources
  database: rename _view_update_concurrency_sem to _view_update_memory_sem
2025-10-28 11:13:24 +02:00
Wojciech Mitros
f07a86d16e mv: limit concurrent view updates from all sources
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.

Fixes https://github.com/scylladb/scylladb/issues/25341
2025-10-27 18:55:41 +01:00
Patryk Jędrzejczak
e1c3f666c9 Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev
Problems addressed by this PR

* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
  * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
  * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
  * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
  * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.

Fixes scylladb/scylladb#26150

backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting

Closes scylladb/scylladb#26315

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
  test_fencing: fix due to new version increment
  test_automatic_cleanup: clean it up
  storage_proxy: wait for closing sessions in sstable cleanup fiber
  storage_proxy: rename await_pending_writes -> await_stale_pending_writes
  storage_proxy: use run_fenceable_write
  storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
  storage_proxy: introduce run_fenceable_write
  storage_proxy: move update_fence_version from shared_token_metadata
  storage_proxy: fix start_write() operation scope in apply_locally
  storage_proxy: move post fence check into handle_write
  storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
  storage_proxy::handle_read: add fence check before get_schema
  storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
  sstable_cleanup_fiber: use coroutine::parallel_for_each
  storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
  topology_coordinator: barrier before cleanup
  topology_coordinator: small start_cleanup refactoring
  global_token_metadata_barrier: add fenced flag
2025-10-27 12:35:13 +01:00
Avi Kivity
b843d8bc8b Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.

The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.

Fixes: https://github.com/scylladb/scylladb/issues/26506

New feature, no backport.

Closes scylladb/scylladb#26515

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
  replica/mutation_dump: add support for virtual tables
  tools/scylla-sstable: print_query_results_json(): handle empty value buffer
  tools/scylla-sstable: add cql support to write operation
  tools/scylla-sstable: write_operation(): fix indentation
  tools/scylla-sstable: write_operation(): prepare for a new input-format
  tools/scylla-sstable: generalize query_operation_validate_query()
  tools/scylla-sstable: move query_operation_validate_query()
  tools/scylla-sstable: extract schema transformation from query operation
  replica/table: add virtual write hook to the other apply() overload too
2025-10-24 23:32:40 +03:00
Avi Kivity
997b52440e Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.

Fixes scylladb/scylladb#23707.

Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.

Closes scylladb/scylladb#26227

* github.com:scylladb/scylladb:
  replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
  replica: add tombstone_gc_enabled parameter to mutation query methods
  mutation/mutation_compactor: remove _can_gc member
  tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
2025-10-24 23:26:16 +03:00
Ferenc Szili
b4ca12b39a load_stats: change data structure which contains tablet sizes
This patch changes the tablet size map in load_stats. Previously, this
data structure was:

std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;

and is changed into:

std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;

This allows for improved performance of tablet tablet size reconciliation.
2025-10-24 14:37:00 +02:00
Wojciech Mitros
c0d0f8f85b database: rename _view_update_concurrency_sem to _view_update_memory_sem
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
2025-10-23 10:00:15 +02:00
Petr Gusev
22271b9fe7 test_automatic_cleanup: add test_cleanup_waits_for_stale_writes 2025-10-22 16:31:43 +02:00
Łukasz Paszkowski
7ec369b900 database: Log message after critical_disk_utilization mode is set
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030

The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.

It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.

Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.

The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated

The tests are updated to wait for the second message.

Fixes https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26392
2025-10-20 13:24:10 +03:00
Pavel Emelyanov
44ed3bbb7c Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.

Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.

This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.

Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).

Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.

Fixes #25359
Fixes #26453

Closes scylladb/scylladb#26186

* github.com:scylladb/scylladb:
  docs::dev::object_storage: Add some initial info on GS storage
  docs/dev: Add mention of (nested) docker usage in testing.md
  sstables::object_storage_client: Forward memory limit semaphore to GS instance
  utils::gcp::object_storage: Add optional memory limits to up/download
  sstables::object_storage_client: Add multi-upload support for GS
  utils::gcp::storage: Add merge objects operation
  test_backup/test_basic: Make tests multiplex both s3 and gs backends
  test::cluster::conftest: Add support for multiple object storage backends
  boost::gcs_storage_test: reindent
  boost::gcs_storage_test: Convert to use fixture
  tests::boost: Add GS object storage cases to mirror S3 ones
  tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
  tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
  sstables::object_storage_client: Add google storage implementation
  test_services: Allow testing with GS object storage parameters
  utils::gcp::gcp_credentials: Add option to create uninitialized credentials
  utils::gcp::object_storage: Make create_download_source return seekable_data_source
  utils::gcp::object_storage: Add defensive copies of string_view params
  utils::gcp::object_storage: Add missing retry backoff increate
  utils::gcp::object_storage: Add timestamp to object listing
  utils::gcp::object_storage: Add paging support to list_objects
  object_storage_client: Add object_name wrapper type
  utils::gcp::object_storage: Add optional abort_source
  utils::rest::client: Add abort_source support
  sstables: Use object_storage_client for remote storage
  sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
  s3::upload_progress: Promote to general util type
  storage_options: Abstract s3 to "object_storage" and add gs as option
  sstables::file_io_extension: Change "creator" callback to just data_source
  utils::io-wrappers: Add ranged data_source
  utils::io-wrappers: Add file wrapper type for seekable_source
  utils::seekable_source: Add a seekable IO source type
  object_storage_endpoint_param: Add gs storage as option
  config: break out object_storage_endpoint_param preparing for multi storage
2025-10-20 13:14:53 +03:00
Piotr Dulikowski
70b0cfb13e Merge 'test: cluster: Replica exceptions tests' from Dario Mirovic
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.

The tests cover apply counter update and apply functionalities in the database layer.

There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.

Fixes #18164

This is an improvement. No backport needed.

Closes scylladb/scylladb#25992

* github.com:scylladb/scylladb:
  test: cluster: test replica write timeout
  database: parameterize apply_counter_update_delay_5s injector value
  test: cluster: test replica exceptions - test rate limit exceptions
2025-10-20 10:03:31 +03:00
Dario Mirovic
1d93f342f9 test: cluster: test replica write timeout
This patch introduces test `test_replica_database_apply_timeout`.
It tests timeout on database write. The test uses error injection
that returns timeout error if the injection `database_apply_force_timeout`
is enabled.

Refs #18164
2025-10-17 11:52:11 +02:00
Dario Mirovic
ff88fe2d76 database: parameterize apply_counter_update_delay_5s injector value
Parameterize `apply_counter_update_delay_5s` injector value. Instead of
sleeping 5s when the injection is active, read parameter value that
specifies sleep duration. To reflect these changes, it is renamed to
`apply_counter_update_delay_ms` and the sleep duration is specified in
milliseconds.

Refs #18164
2025-10-17 11:52:10 +02:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Tomasz Grabiec
e6c427953e Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.

Key changes:

- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.

Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414

Closes scylladb/scylladb#25302

* github.com:scylladb/scylladb:
  db: schema_applier: update tablet metadata atomically
  db: replica: move tables_metadata locking to commit
  storage_service: abstract schema dependecies during token metadata update
  storage_service: split replicate_to_all_cores to steps
  db: schema_applier: unify token_metadata loading
  replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
  service: fix dependencies during migration_manager startup
  db: schema_applier: move pending_token_metadata to locator
  db: always use _tablet_hint as condition for tablet metadata change
  db: refactor new_token_metadata into pending_token_metadata
  db: rename new_token_metadata to pending_token_metadata
  db: schema_applier: move types storage init to merge_types func
  db: schema_applier: make merge functions non-static members
  db: remove unused proxy from create_keyspace_metadata
2025-10-16 21:43:49 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Marcin Maliszkiewicz
e5fffa158f db: replica: move tables_metadata locking to commit
This keeps the locking scope minimal, and since
unlocking is done in commit(), locking fits here as well.
2025-10-16 10:56:10 +02:00
Botond Dénes
5d70450917 replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
Make use of the freshly introduced facility to disable
garbage-collection on a per-query basis for range scans. This is needed
so partitions that only contain garbage-collectible data are not missing
from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(),
the user is expecting to see *all* data, even that which is dead and
garbage-collectible.

Include a test which reproduces the issue.
2025-10-16 10:40:28 +03:00
Botond Dénes
734a9934a6 replica: add tombstone_gc_enabled parameter to mutation query methods
Allow disabling tombstone gc on a per-query basis for mutation queries.
This is achieved by a bool flag passed to mutation query variants like
`query_mutations_on_all_shards()` and `database::mutation_query()`,
which is then propagated down to compaction_mutation_state.
The future user (in the next patch) is the SELECT * FROM
MUTATION_FRAGMENTS() statement which wants to see dead partitions
(and rows) when scanning a table. Currently, due to garbage collections,
said statement can miss partitions which only contain
garbage-collectible tombstones.
2025-10-16 10:38:47 +03:00
Botond Dénes
d0844abb5c mutation/mutation_compactor: compaction:stats: split partitions
Into total and live. Currently only live (those with live content) are
counted. Report live and total seprately, just like we do for rows. This
allows deducing the count of dead partitions as well, which is
particularly interesting for scans.

Closes scylladb/scylladb#26548
2025-10-14 19:08:47 +03:00
Marcin Maliszkiewicz
d67632bfe2 replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
This copy is now used during the whole duration of schema merge.
If it changes due to tablet_hint then it's replicated to all shards as before.
2025-10-14 10:56:36 +02:00
Marcin Maliszkiewicz
209563f478 db: remove unused proxy from create_keyspace_metadata 2025-10-14 10:56:25 +02:00
Botond Dénes
180bf647f7 replica/mutation_dump: add support for virtual tables
Not supported currently as such tables have no memtables, cache or
sstables, so any select * from mutation_fragments() query will return
empty result.
Detect virtual tables and add return their content with a distinct
'virtual-table' mutation_source designation.
2025-10-13 18:10:40 +03:00
Botond Dénes
970d4f0dcd replica/table: add virtual write hook to the other apply() overload too
Currently only one has it, which means virtual table can potentially
miss some writes.
2025-10-13 17:35:50 +03:00
Calle Wilund
5d4558df3b sstables: Use object_storage_client for remote storage
Replaces direct s3 interfaces with the abstraction layer, and open
for having multiple implentations/backends
2025-10-13 08:53:25 +00:00
Botond Dénes
24c6476f73 mutation/mutation_compactor: add tombstone_gc_state to query ctor
So tombstones can be purged correctly based on the tombstone gc mode.
Currently if repair-mode is used, tombstones are not purged at all,
which can lead to purged tombstone being re-replicated to replicas which
already purged them via read-repair.
This is not a correctness problem, tombstones are not included in data
query resutl or digest, these purgable tombstone are only a nuissance
for read repair, where they can create extra differences between
replicas. Note that for the read repair to trigger, some difference
other than in purgable tombstones has to exist, because as mentioned
above, these are not included in digets.

Fixes: scylladb/scylladb#24332

Closes scylladb/scylladb#26351
2025-10-12 17:48:15 +03:00
Dawid Mędrek
a9577e4d52 replica/database: Fix description of validate_tablet_views_indexes
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.

Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.

Fixes scylladb/scylladb#26420

Closes scylladb/scylladb#26421
2025-10-07 17:39:43 +02:00
Piotr Dulikowski
380f243986 Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }

Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.

Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.

Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.

New feature, no backport required.

Co-authored with @bhalevy

Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525

Closes scylladb/scylladb#26358

* github.com:scylladb/scylladb:
  tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
  locator: Make hasher for endpoint_dc_rack globally accessible
  test: tablets: Add test for replica allocation on rack list changes
  test: lib: topology_builder: generate unique rack names
  test: Add tests for rack list RF
  doc: Document rack-list replication factor
  topology_coordinator: Restore formatting
  topology_coordinator: Cancel keyspace alter on broader set of errors
  topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
  cql3: ks_prop_defs: Preserve old options
  cql3: ks_prop_defs: Introduce flattened()
  locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
  tablet_allocator: Respect binding replicas to racks
  locator: network_topology_strategy: Respect rack list when reallocating tablets
  cql3: ks_prop_defs: Fail with more information when options are not in expected format
  locator, cql3: Support rack lists in replication options
  cql3: Fail early on vnode/tablet flavor alter
  cql3: Extract convert_property_map() out of Cql.g
  schema: Use definition from the header instead of open-coding it
  locator: Abstract obtaining the number of replicas from replication_strategy_config_option
  cql3, locator: Use type aliases for option maps
  locator: Add debug logging
  locator: Pass topology to replication strategy constructor
  abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
2025-10-06 14:14:09 +02:00
Piotr Dulikowski
e7907b173a Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.

After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.

In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.

After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.

High-level implementation strategy:

1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
   and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
   and add a new one.
5. Update the user documentation.

Fixes scylladb/scylladb#23030

Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.

Closes scylladb/scylladb#25802

* github.com:scylladb/scylladb:
  view: Stop requiring experimental feature
  db/view: Verify valid configuration for tablet-based views
  db/view: Require rf_rack_valid_keyspaces when creating view
  test/cluster/random_failures: Skip creating secondary indexes
  test/cluster/mv: Mark test_mv_rf_change as skipped
  test/cluster: Adjust MV tests to RF-rack-validity
  test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
  db/view: Name requirement for views with tablets
2025-10-06 12:46:46 +02:00
Pavel Emelyanov
6ad8dc4a44 Merge 'root,replica: mv querier to replica/' from Botond Dénes
The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people.
The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear.

Code cleanup, no backport.

Closes scylladb/scylladb#26280

* github.com:scylladb/scylladb:
  replica: move querier code to replica namespace
  root,replica: mv querier to replica/
2025-10-06 08:26:05 +03:00
Ferenc Szili
20aeed1607 load balancing: extend locator::load_stats to collect tablet sizes
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.

This is the first change in the size based load balancing series.

Closes scylladb/scylladb#26035
2025-10-03 13:37:22 +02:00
Tomasz Grabiec
91e51a5dd1 cql3, locator: Use type aliases for option maps
In preparation for changing their structure.

1) std::map<sstring, sstring> -> replication_strategy_config_options

  Parsed options. Values will become std::variant<sstring, rack_list>

2) std::map<sstring, sstring> -> property_definitions::map_type

  Flattened map of options, as stored system tables.
2025-10-01 16:06:51 +02:00
Benny Halevy
da6e2fdb1b locator: Pass topology to replication strategy constructor 2025-10-01 16:06:28 +02:00
Dawid Mędrek
b409e85c20 view: Stop requiring experimental feature
We modify the requirements for using materialized views in tablet-based
keyspaces. Before, it was necessary to enable the configuration option
`rf_rack_valid_keyspaces`, having the cluster feature `VIEWS_WITH_TABLETS`
enabled, and using the experimental feature `views-with-tablets`.
We drop the last requirement.

We adjust code to that change and provide a new validation test.
We also update the user documentation to reflect the changes.

Fixes scylladb/scylladb#23030
2025-10-01 09:01:53 +02:00
Dawid Mędrek
288be6c82d db/view: Verify valid configuration for tablet-based views
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:

* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.

Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.

We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
2025-10-01 09:01:53 +02:00
Benny Halevy
b17a36c071 tablets: read_tablet_mutations: use unfreeze_and_split_gently
Split the tablets mutations by number of rows, based on
`min_tablets_in_mutation` (currently calibrated to 1024),
similar to the splitting done in
`storage_service::merge_topology_snapshot`.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
d21984d0cc tablets: tablet_map_to_mutations: maybe split tablets mutation
Split the generated tablets mutation if we run out of task
quota to prevent stalls, both when preparing the mutations
and later on when freezing/unfreezing them or converting
them to canonical_mutation and back.

Note that this will convert large mutation to long
vectors of mutations.  A followup change is considered
to convert std::vector:s of mutations to chunked_vector
to prevent large allocations.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:41 +03:00
Benny Halevy
aaddff5211 tablets: tablet_map_to_mutations: accept process_func
Prepare for generating several mutations for the
tablet_map by calling process_func for each generated mutation.

This allows the caller to directly freeze those mutations
one at a time into a vector of frozen mutations or simililarly
convert them into canonical mutations.

Next patch will split large tablet mutations to prevent stalls.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-09-30 17:15:38 +03:00
Avi Kivity
4d9271df98 Merge 'sstables: introduce sstable version ms' from Michał Chojnowski
This is yet another part in the BTI index project.

Overarching issue: https://github.com/scylladb/scylladb/issues/19191
Previous part: https://github.com/scylladb/scylladb/pull/25626
Next parts: make `ms` the default. Then, general tweaks and improvements. Later, potentially a full `da` format implementation.

This patch series introduces a new, Scylla-only sstable format version `ms`, which is like `me`, but with the index components (Summary.db and Index.db) replaced with BTI index components (Partitions.db and Rows.db), as they are in Cassandra 5.0's `da` format version.

(Eventually we want to just implement `da`, but there are several other changes (unrelated to the index files) between `me` and `da`. By adding this `ms` as an intermediate step we can adapt the new index formats without dragging all the other changes into the mix (and raising the risk of regressions, which is already high)).

The high-level structure of the PR is:
1. Introduce new component types — `Partitions` and `Rows`.
2. Teach `class sstable` to open them when they exist.
3. Teach the sstable writer how to write index data to them.
4. Teach `class sstable` and unit tests how to deal with sstables that have no `Index` or `Summary` (but have `Partitions` and `Rows` instead).
5. Introduce the new sstable version `ms`, specify that it has `Partitions` and `Rows` instead of `Index` and `Summary`.
6. Prepare unit tests for the appearance of `ms`.
7. Enable `ms` in unit tests.
8. Make `ms` enablable via db::config (with a silent fall back to `me` until the new `MS_SSTABLE_FORMAT` cluster feature is enabled).
9. Prepare integration tests for the appearance of `ms`.
10. Enable both `ms` and `me` in tests where we want both versions to be tested.

This series doesn't make `ms` the default yet, because that requires teaching Scylla Manager and a few dtests about the new format first. It can be enabled by setting `sstable_format: ms` in the config.

Per a review request, here is an example from `perf_fast_forward`, demonstrating some motivation for a new format. (Although not the main one. The main motivations are getting rid of restrictions on the RAM:disk ratio, and index read throughput for datasets with tiny partitions). The dataset was populated with `build/release/scylla perf-fast-forward --smp=1 --sstable-format=$VERSION --data-directory=data.$VERSION --column-index-size-in-kb=1 --populate --random-seed=0`.
This test involves a partition with 1000000 clustering rows (with 32-bit keys and 100-byte values) and ~500 index blocks, and queries a few particular rows from the partition. Since the branching factor for the BIG promoted index is 2 (it's a binary search), the lookup involves ~11.2 sequential page reads per row. The BTI format has a more reasonable branching factor, so it involves ~2.3 page reads per row.

`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/me --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                70       18.0     18        128
500001  1       1               647       19.0     19        132
0       1000000 1               748       15.0     15        116
0       500000  2               372       29.0     29        284
0       250000  4               227       56.0     56        504
0       125000  8               116      106.0    106        928
0       62500   16               67      195.0    195       1732
```
`build/release/scylla perf-fast-forward --smp=1 --data-directory=perf_fast_forward_data/ms --run-tests=large-partition-select-few-rows`:
```
offset  stride  rows     iterations    avg aio    aio      (KiB)
500000  1       1                51        5.1      5         20
500001  1       1                64        5.3      5         20
0       1000000 1               679        4.0      4         16
0       500000  2               492        8.0      8         88
0       250000  4               804       16.0     16        232
0       125000  8               409       31.0     31        516
0       62500   16               97       54.0     54       1056
```

Index file size comparison for the default `perf_fast_forward` tables with `--random-seed=0`:
Large partition table (dominated by intra-partition index): 2.4 MB with `me`, 732 kB with `ms`.
For the small partitions table (dominated by inter-partition index): 11 MB with `me`, 8.4 MB with `ms`.

External tests:
I ran SCT test `longevity-mv-si-4days-streaming-test` test on 6 nodes with 30 shards each for 8 hours. No anomalies were observed.

New functionality, no backport needed.

Closes scylladb/scylladb#26215

* github.com:scylladb/scylladb:
  test/boost/bloom_filter_test: add test_rebuild_from_temporary_hashes
  test/cluster: add test_bti_index.py
  test: prepare bypass_cache_test.py for `ms` sstables
  sstables/trie/bti_index_reader: add a failure injection in advance_lower_and_check_if_present
  test/cqlpy/test_sstable_validation.py: prepare the test for `ms` sstables
  tools/scylla-sstable: add `--sstable-version=?` to `scylla sstable write`
  db/config: expose "ms" format to the users via database config
  test: in Python tests, prepare some sstable filename regexes for `ms`
  sstables: add `ms` to `all_sstable_versions`
  test/boost/sstable_3_x_test: add `ms` sstables to multi-version tests
  test/lib/index_reader_assertions: skip some row index checks for BTI indexes
  test/boost/sstable_inexact_index_test: explicitly use a `me` sstable
  test/boost/sstable_datafile_test: skip test_broken_promoted_index_is_skipped for `ms` sstables
  test/resource: add `ms` sample sstable files for relevant tests
  test/boost/sstable_compaction_test: prepare for `ms` sstables.
  test/boost/index_reader_test: prepare for `ms` sstables
  test/boost/bloom_filter_tests: prepare for `ms` sstables
  test/boost/sstable_datafile_test: prepare for `ms` sstables
  test/boost/sstable_test: prepare for `ms` sstables.
  sstables: introduce `ms` sstable format version
  tools/scylla-sstable: default to "preferred" sstable version, not "highest"
  sstables/mx/reader: use the same hashed_key for the bloom filter and the index reader
  sstables/trie/bti_index_reader: allow the caller to passing a precalculated murmur hash
  sstables/trie/bti_partition_index_writer: in add(), get the key hash from the caller
  sstables/mx: make Index and Summary components optional
  sstables: open Partitions.db early when it's needed to populate key range for sharding metadata
  sstables: adapt sstable::set_first_and_last_keys to sstables without Summary
  sstables: implement an alternative way to rebuild bloom filters for sstables without Index
  utils/bloom_filter: add `add(const hashed_key&)`
  sstables: adapt estimated_keys_for_range to sstables without Summary
  sstables: make `sstable::estimated_keys_for_range` asynchronous
  sstables/sstable: compute get_estimated_key_count() from Statistics instead of Summary
  replica/database: add table::estimated_partitions_in_range()
  sstables/mx: implement sstable::has_partition_key using a regular read
  sstables: use BTI index for queries, when present and enabled
  sstables/mx/writer: populate BTI index files
  sstables: create and open BTI index files, when enabled
  sstables: introduce Partition and Rows component types
  sstables/mx/writer: make `_pi_write_m.partition_tombstone` a `sstables::deletion_time`
2025-09-30 09:40:02 +03:00
Piotr Dulikowski
4581c72430 Merge 'lwt: prohibit for tablet-based views and cdc logs' from Petr Gusev
`SELECT` commands with SERIAL consistency level are historically allowed for vnode-based views, even though they don't provide linearizability guarantees and in general don't make much sense. In this PR we prohibit LWTs for tablet-based views, but preserve old behavior for vnode-based views for compatibility. Similar logic is applied to CDC log tables.

We also add a general check that disallows colocating a table with another colocated table, since this is not needed for now.

Fixes https://github.com/scylladb/scylladb/issues/26258

backports: not needed (a new feature)

Closes scylladb/scylladb#26284

* github.com:scylladb/scylladb:
  cql_test_env.cc: log exception when callback throws
  lwt: prohibit for tablet-based views and cdc logs
  tablets: disallow chains of colocated tables
  database: get_base_table_for_tablet_colocation: extract table_id_by_name lambda
2025-09-30 07:15:16 +02:00