Commit Graph

585 Commits

Author SHA1 Message Date
copilot-swe-agent[bot]
77ee7f3417 Revert "Merge 'Add option to use sstable identifier in snapshot' from Benny Halevy"
This reverts commit 8192f45e84.

The merge exposed a bug where truncate (via drop) fails and causes Raft
errors, leading to schema inconsistencies across nodes. This results in
test_table_drop_with_auto_snapshot failures with 'Keyspace test does not exist'
errors.

The specific problematic change was in commit 19b6207f which modified
truncate_table_on_all_shards to set use_sstable_identifier = true. This
causes exceptions during truncate that are not properly handled, leading
to Raft applier fiber stopping and nodes losing schema synchronization.
2025-12-12 03:55:13 +00:00
Tomasz Grabiec
0e51a1f812 replica: Remove unnecessary noexcept
Can potentially lead to unnecessary abort.

compaction_groups() and for_each_compaction_group() can throw.

Co-authored-by: bhalevy <20910904+bhalevy@users.noreply.github.com>
2025-12-10 14:51:35 +01:00
Benny Halevy
9b3fbedc8c table: snapshot_on_all_shards: take snapshot_options
And pass the use_sstable_identifier down the stack
to the sstables layer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 11:55:50 +02:00
Benny Halevy
c18133b6cb db: snapshot_ctl: move skip_flush to struct snapshot_options
Prepare for adding another option: use_sstable_identifer.

Signed-off-by: Benny Halevy <bhalevy@scylladb.com>
2025-12-04 09:46:35 +02:00
Aleksandra Martyniuk
19a7d8e248 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165
2025-11-24 06:42:40 +02:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Botond Dénes
8579e20bd1 Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.

This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.

To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.

Backport is not required, it is an improvement

Refs #21776

Closes scylladb/scylladb#26444

* github.com:scylladb/scylladb:
  boost/repair_test: add repair reader integrity verification test cases
  test/lib: allow to disable compression in random_mutation_generator
  sstables: Skip checksum and digest reads for unlinked SSTables
  table: enable integrity checks for streaming reader
  table: Add integrity option to table::make_sstable_reader()
  sstables: Add integrity option to create_single_key_sstable_reader
2025-11-14 18:00:33 +02:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Michael Litvak
7cc6b0d960 replica/db: add counter update guard
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.

This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
2025-11-03 08:43:11 +01:00
Michael Litvak
88fd9a34c4 replica/db: split counter update helper functions
Split do_apply_counter_update to a few smaller and simpler functions to
help prepare for a later change.
2025-11-03 08:43:11 +01:00
Taras Veretilnyk
06e1b47ec6 table: Add integrity option to table::make_sstable_reader() 2025-10-28 19:27:35 +01:00
Avi Kivity
d81796cae3 Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

Fixes https://github.com/scylladb/scylladb/issues/25341

Closes scylladb/scylladb#25456

* github.com:scylladb/scylladb:
  mv: limit concurrent view updates from all sources
  database: rename _view_update_concurrency_sem to _view_update_memory_sem
2025-10-28 11:13:24 +02:00
Wojciech Mitros
f07a86d16e mv: limit concurrent view updates from all sources
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.

Fixes https://github.com/scylladb/scylladb/issues/25341
2025-10-27 18:55:41 +01:00
Avi Kivity
b843d8bc8b Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.

The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.

Fixes: https://github.com/scylladb/scylladb/issues/26506

New feature, no backport.

Closes scylladb/scylladb#26515

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
  replica/mutation_dump: add support for virtual tables
  tools/scylla-sstable: print_query_results_json(): handle empty value buffer
  tools/scylla-sstable: add cql support to write operation
  tools/scylla-sstable: write_operation(): fix indentation
  tools/scylla-sstable: write_operation(): prepare for a new input-format
  tools/scylla-sstable: generalize query_operation_validate_query()
  tools/scylla-sstable: move query_operation_validate_query()
  tools/scylla-sstable: extract schema transformation from query operation
  replica/table: add virtual write hook to the other apply() overload too
2025-10-24 23:32:40 +03:00
Avi Kivity
997b52440e Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.

Fixes scylladb/scylladb#23707.

Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.

Closes scylladb/scylladb#26227

* github.com:scylladb/scylladb:
  replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
  replica: add tombstone_gc_enabled parameter to mutation query methods
  mutation/mutation_compactor: remove _can_gc member
  tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
2025-10-24 23:26:16 +03:00
Wojciech Mitros
c0d0f8f85b database: rename _view_update_concurrency_sem to _view_update_memory_sem
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
2025-10-23 10:00:15 +02:00
Marcin Maliszkiewicz
e5fffa158f db: replica: move tables_metadata locking to commit
This keeps the locking scope minimal, and since
unlocking is done in commit(), locking fits here as well.
2025-10-16 10:56:10 +02:00
Botond Dénes
734a9934a6 replica: add tombstone_gc_enabled parameter to mutation query methods
Allow disabling tombstone gc on a per-query basis for mutation queries.
This is achieved by a bool flag passed to mutation query variants like
`query_mutations_on_all_shards()` and `database::mutation_query()`,
which is then propagated down to compaction_mutation_state.
The future user (in the next patch) is the SELECT * FROM
MUTATION_FRAGMENTS() statement which wants to see dead partitions
(and rows) when scanning a table. Currently, due to garbage collections,
said statement can miss partitions which only contain
garbage-collectible tombstones.
2025-10-16 10:38:47 +03:00
Marcin Maliszkiewicz
d67632bfe2 replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
This copy is now used during the whole duration of schema merge.
If it changes due to tablet_hint then it's replicated to all shards as before.
2025-10-14 10:56:36 +02:00
Botond Dénes
180bf647f7 replica/mutation_dump: add support for virtual tables
Not supported currently as such tables have no memtables, cache or
sstables, so any select * from mutation_fragments() query will return
empty result.
Detect virtual tables and add return their content with a distinct
'virtual-table' mutation_source designation.
2025-10-13 18:10:40 +03:00
Dawid Mędrek
a9577e4d52 replica/database: Fix description of validate_tablet_views_indexes
The current description is not accurate: the function doesn't throw
an exception if there's an invalid materialized view. Instead, it
simply logs the keyspaces that violate the requirement.

Furthermore, the experimental feature `views-with-tablets` is no longer
necessary for considering a materialized view as valid. It was dropped
in scylladb/scylladb@b409e85c20. The
replacement for it is the cluster feature `VIEWS_WITH_TABLETS`.

Fixes scylladb/scylladb#26420

Closes scylladb/scylladb#26421
2025-10-07 17:39:43 +02:00
Piotr Dulikowski
380f243986 Merge ' Support replication factor rack list for tablet-based keyspaces' from Tomasz Grabiec
This change extends the CQL replication options syntax so the replication factor can be stated as a list of rack names.
For example: { 'mydatacenter': [ 'myrack1', 'myrack2', 'myrack4' ] }

Rack-list based RF can coexist with the old numerical RF, even in the same keyspace for different DCs.

Specifying the rack list also allows to add replicas on the specified racks (increasing the replication factor), or decommissioning certain racks from their replicas (by omitting them from the current datacenter rack-list). This will allow us to keep the keyspace rf-rack-valid, maintaining guarantees, while allowing adding/removing racks. In particular, this will allow us to add a new DC, which happens by incrementally increasing RF in that DC to cover existing racks.

Migration from numerical RF to rack-list is not supported yet. Migration from rack-list to numerical RF is not planned to be supported.

New feature, no backport required.

Co-authored with @bhalevy

Fixes https://github.com/scylladb/scylladb/issues/25269
Fixes https://github.com/scylladb/scylladb/issues/23525

Closes scylladb/scylladb#26358

* github.com:scylladb/scylladb:
  tablets: load_balancer: Recognize that tablets are confined to racks when computing desired tablet count
  locator: Make hasher for endpoint_dc_rack globally accessible
  test: tablets: Add test for replica allocation on rack list changes
  test: lib: topology_builder: generate unique rack names
  test: Add tests for rack list RF
  doc: Document rack-list replication factor
  topology_coordinator: Restore formatting
  topology_coordinator: Cancel keyspace alter on broader set of errors
  topology_coordinator: Make keyspace alter process options through as_ks_metadata_update()
  cql3: ks_prop_defs: Preserve old options
  cql3: ks_prop_defs: Introduce flattened()
  locator: Recognize rack list RF as valid in assert_rf_rack_valid_keyspace()
  tablet_allocator: Respect binding replicas to racks
  locator: network_topology_strategy: Respect rack list when reallocating tablets
  cql3: ks_prop_defs: Fail with more information when options are not in expected format
  locator, cql3: Support rack lists in replication options
  cql3: Fail early on vnode/tablet flavor alter
  cql3: Extract convert_property_map() out of Cql.g
  schema: Use definition from the header instead of open-coding it
  locator: Abstract obtaining the number of replicas from replication_strategy_config_option
  cql3, locator: Use type aliases for option maps
  locator: Add debug logging
  locator: Pass topology to replication strategy constructor
  abstract_replication_strategy, network_topology_strategy: add replication_factor_data class
2025-10-06 14:14:09 +02:00
Piotr Dulikowski
e7907b173a Merge 'db/view: Require rf_rack_valid_keyspaces when creating materialized view' from Dawid Mędrek
Materialized views are currently in the experimental phase and using them
in tablet-based keyspaces requires starting Scylla with an experimental feature,
`views-with-tablets`. Any attempts to create a materialized view or secondary
index when it's not enabled will fail with an appropriate error.

After considerable effort, we're drawing close to bringing views out of the
experimental phase, and the experimental feature will no longer be needed.
However, materialized views in tablet-based keyspaces will still be restricted,
and creating them will only be possible after enabling the configuration option
`rf_rack_valid_keyspaces`. That's what we do in this PR.

In this patch, we adjust existing tests in the tree to work with the new
restriction. That shouldn't have been necessary because we've already seemingly
adjusted all of them to work with the configuration option, but some tests hid
well. We fix that mistake now.

After that, we introduce the new restriction. What's more, when starting Scylla,
we verify that there is no materialized view that would violate the contract.
If there are some that do, we list them, notify the user, and refuse to start.

High-level implementation strategy:

1. Name the restrictions in form of a function.
2. Adjust existing tests.
3. Restrict materialized views by both the experimental feature
   and the configuration option. Add validation test.
4. Drop the requirement for the experimental feature. Adjust the added test
   and add a new one.
5. Update the user documentation.

Fixes scylladb/scylladb#23030

Backport: 2025.4, as we are aiming to support materialized views for tablets from that version.

Closes scylladb/scylladb#25802

* github.com:scylladb/scylladb:
  view: Stop requiring experimental feature
  db/view: Verify valid configuration for tablet-based views
  db/view: Require rf_rack_valid_keyspaces when creating view
  test/cluster/random_failures: Skip creating secondary indexes
  test/cluster/mv: Mark test_mv_rf_change as skipped
  test/cluster: Adjust MV tests to RF-rack-validity
  test/boost/schema_loader_test.cc: Explicitly enable rf_rack_valid_keyspaces
  db/view: Name requirement for views with tablets
2025-10-06 12:46:46 +02:00
Pavel Emelyanov
6ad8dc4a44 Merge 'root,replica: mv querier to replica/' from Botond Dénes
The querier object is a confusing one. Based on its name it should be in the query/ module and it is already in the query namespace. The query namespace is used for symbols which span the coordinator and replica, or that are mostly coordinator side. The querier is mainly in this namespace due to its similar name and because at the time it was introduced, namespace replica didn't exist yet. But this is a mistake which confuses people.
The querier is actually a completely replica-side logic, implementing the caching of the readers on the replica. Move it to the replica module and namespace to make this more clear.

Code cleanup, no backport.

Closes scylladb/scylladb#26280

* github.com:scylladb/scylladb:
  replica: move querier code to replica namespace
  root,replica: mv querier to replica/
2025-10-06 08:26:05 +03:00
Ferenc Szili
20aeed1607 load balancing: extend locator::load_stats to collect tablet sizes
This commit extend the TABLE_LOAD_STATS RPC with data about the tablet
replica sizes and effective disk capacity.
Effective disk capacity of a node is computed as a sum of the sizes of
all tablet replicas on a node and available disk space.

This is the first change in the size based load balancing series.

Closes scylladb/scylladb#26035
2025-10-03 13:37:22 +02:00
Benny Halevy
da6e2fdb1b locator: Pass topology to replication strategy constructor 2025-10-01 16:06:28 +02:00
Dawid Mędrek
288be6c82d db/view: Verify valid configuration for tablet-based views
Creating a materialized view or a secondary index in a tablet-based
keyspace requires that the user enabled two options:

* experimental feature `views-with-tablets`,
* configuration option `rf_rack_vaid_keyspaces`.

Because the latter has only become a necessity recently (in this series),
it's possible that there are already existing materialized views that
violate it.

We add a new check at start-up that iterates over existing views and
makes sure that that is not the case. Otherwise, Scylla notifies the user
of the problem.
2025-10-01 09:01:53 +02:00
Michał Chojnowski
68c33c0173 replica/database: add table::estimated_partitions_in_range()
Add a function which computes an estimated number of partitions
in the given token range. We will use this helper in a later patch
to replace a few places in the code which de facto do the same
thing "manually".
2025-09-29 13:01:21 +02:00
Botond Dénes
2b4a140610 replica: move querier code to replica namespace
The query namespace is used for symbols which span the coordinator and
replica, or that are mostly coordinator side. The querier is mainly in
this namespace due to its similar name, but this is a mistake which
confuses people. Now that the code was moved to replica/, also fix the
namespace to be namespace replica.
2025-09-29 06:44:52 +03:00
Botond Dénes
ee3d2f5b43 root,replica: mv querier to replica/
The querier object is a confusing one. Based on its name it should be in
the query/ module and it is already in the query namespace. But this is
actually a completely replica-side logic, implementing the caching of
the readers on the replica. Move it to the replica module to make this
more clear.
2025-09-29 06:33:53 +03:00
Pavel Emelyanov
f3c57f7dd0 table: Move for_all_partitions_slow() to test
It's now only used by a single test, so move it there and remove from
public table API.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-09-26 16:33:25 +03:00
Botond Dénes
1999d8e3d3 compaction: remove using namespace {compaction,sstables}
Some files in compaction/ have using namespace {compaction,sstables}
clauses, some even in headers. This is considered bad practice and
muddies the namespace use. Remove them.
2025-09-25 15:03:57 +03:00
Botond Dénes
86ed627fc4 compaction: move code to namespace compaction
The namespace usage in this directory is very inconsistent, with files
and classes scattered in:
* global namespace
* namespace compaction
* namespace sstables

With cases, where all three used in the same file. This code used to
live in sstables/ and some of it still retains namespace sstables as a
heritage of that time. The mismatch between the dir (future module) and
the namespace used is confusing, so finish the migration and move all
code in compaction/ to namespace compaction too.

This patch, although large, is mechanic and only the following kind of
changes are made:
* replace namespace sstable {} with namespace compaction {}
* add namespace compaction {}
* drop/add sstables::
* drop/add compaction::
* move around forward-declarations so they are in the correct namespace
  context

This refactoring revealed some awkward leftover coupling between
sstables and compaction, in sstables/sstable_set.cc, where the
make_sstable_set() methods of compaction strategies are implemented.
2025-09-25 15:03:56 +03:00
Wojciech Mitros
d9b8278178 mv: handle mismatched base/view replica count caused by RF change
During an ALTER KEYSPACE statement execution where a table with a view
is present, we need to perform tablet migrations for both tables.
These migrations are not synchronized, so at some point the base may
have a different number of non-pending replicas than the view. Because
of that, we can't pair them correctly. If there is more non-pending
base replicas than view replicas, we don't need to do anything because
the view replica that didn't finish migrating is a pending replica
and will get view updates from all base replicas. But if there is more
non-pending view replicas than base replicas, we may currently lose
view updates to the new view replica.

This patch adds a workaround for this scenario. If after one migration
we have too more non-pending view replicas than base replicas, we add
it to the pending replica list so that it gets an update anyway.

This patch will also take effect if the base and view replica counts
differ due to some other bug. To track that, a new metric is added
to count such occurrences.

This patch also includes a test for this exact scenario, which is enforced by an injection.

Fixes https://github.com/scylladb/scylladb/issues/21492
2025-09-22 12:50:16 +02:00
Michał Chojnowski
9e70df83ab db: get rid of sstables-format-selector
Our sstable format selection logic is weird, and hard to follow.

If I'm not misunderstanding, the pieces are:
1. There's the `sstable_format` config entry, which currently
   doesn't do anything, but in the past it used to disable
   cluster features for versions newer than the specified one.
2. There are deprecated and unused config entries for individual
   versions (`enable_sstables_mc_format`, `enable_sstables_md_format`,
   etc).
3. There is a cluster feature for each version:
   ME_SSTABLE_FORMAT, MD_SSTABLE_FORMAT, etc.
   (Currently all sstable version features have been grandfathered,
   and aren't checked by the code anymore).
4. There's an entry in `system.scylla_local` which contains the
   latest enabled sstable version. (Why? Isn't this directly derived
   from cluster features anyway)?
5. There's `sstable_manager::_format` which contains the
   sstable version to be used for new writes.
   This field is updated by `sstables_format_selector`
   based on cluster features and the `system.scylla_local` entry.

I don't see why those pieces are needed. Version selection has the
following constraints:
1. New sstables must be written with a format that supports existing
   data. For example, range tombstones with an infinite bound are only
   supported by sstables since version "mc". So if a range tombstone
   with an infinite bound exists somewhere in the dataset,
   the format chosen for new sstables has to be at least as new as "mc".
2. A new format might only be used after a corresponding cluster feature
   is enabled. (Otherwise new sstables might become unreadable if they
   are sent to another node, or if a node is downgraded).
3. The user should have a way to inhibit format ugprades if he wishes.

So far, constraint (1) has been fulfilled by never using formats older
than the newest format ever enabled on the node. (With an exception
for resharding and reshaping system tables).
Constraint (2) has been fulfilled by calling `sstable_manager::set_format`
only after the corresponsing cluster feature is enabled.
Constraint (3) has been fulfilled by the ability to inhibit cluster
features by setting `sstable_format` by some fixed value.

The main thing I don't like about this whole setup is that it doesn't
let me downgrade the preferred sstable format. After a format is
enabled, there is no way to go back to writing the old format again.
That is no good -- after I make some performance-sensitive changes
in a new format, it might turn out to be a pessimization for the
particular workload, and I want to be able to go back.

This patch aims to give a way to downgrade formats without violating
the constraints. What it does is:
1. The entry in `system.scylla_local` becomes obsolete.
   After the patch we no longer update or read it.
   As far as I understand, the purpose of this entry is to prevent
   unwanted format downgrades (which is something cluster features
   are designed for) and it's updated if and only if relevant
   cluster features are updated. So there's no reason to have it,
   we can just directly use cluster features.
2. `sstable_format_selector` gets deleted.
   Without the `system.scylla_local` around, it's just a glorified
   feature listener.
3. The format selection logic is moved into `sstable_manager`.
   It already sees the `db::config` and the `gms::feature_service`.
   For the foreseeable future, the knowledge of enabled cluster features
   and current config should be enough information to pick the right formats.
4. The `sstable_format` entry in `db::config` is no longer intended to
   inhibit cluster features. Instead, it is intended to select the
   format for new sstables, and it becomes live-updatable.
5. Instead of writing new sstables with "highest supported" format,
   (which used to be set by `sstables_format_selector`) we write
   them with the "preferred" format, which is determined by
   `sstable_manager` based on the combination of enabled features
   and the current value of `sstable_format`.

Closes scylladb/scylladb#26092

[avi: Pavel found the reason for the scylla_local entry -
      it predates stable storage for cluster features]
2025-09-19 16:17:56 +03:00
Pavel Emelyanov
a1ea553fe1 code: Replace distributed<> with sharded<>
The latter is recommended in seastar, and the former was left as
compatibility alias. Latest seastar explicitly marks it as deprecated so
once the submodule is updated, compilation logs will explode.

Most of the patch is generated with

    for f in $(git grep -l '\<distributed<[A-Za-z0-9:_]*>') ; do sed -e 's/\<distributed<\([A-Za-z0-9:_]*\)>/sharded<\1>/g' -i $f; done
    for f in $(git grep -l distributed.hh); do sed -e 's/distributed.hh/sharded.hh/' -i $f ; done

and a small manual change in test/perf/perf.hh

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26136
2025-09-19 12:22:51 +02:00
Ernest Zaslavsky
d624413ddd treewide: Move query related files to a new query directory
As requested in #22120, moved the files and fixed other includes and build system.

Moved files:
- query.cc
- query-request.hh
- query-result.hh
- query-result-reader.hh
- query-result-set.cc
- query-result-set.hh
- query-result-writer.hh
- query_id.hh
- query_result_merger.hh

Fixes: #22120

This is a cleanup, no need to backport

Closes scylladb/scylladb#25105
2025-09-16 23:40:47 +03:00
Botond Dénes
6116f9e11b Merge 'Compaction tasks progress' from Aleksandra Martyniuk
Determine the progress of compaction tasks that have
children.

The progress of a compaction task is calculated using the default
get_progress method. If the expected_total_workload method is
implemented, the default progress is computed as:
(sum of child task progresses) / (expected total workload)

If expected_total_workload is not defined, progress is estimated based
on children progresses. However, in this case, the total progress may
increase over time as the task executes.

All compaction tasks, except for reshape tasks, implement the
expected_children_number method. To compute expected_total_workload,
iterate over all SSTables covered by the task and sum their sizes. Note
that expected_total_workload is just an approximation and the real workload
may differ if SStables set for the keyspace/table/compaction group changes.

Reshape tasks are an exception, as their scope is determined during
execution. Hence, for these tasks expected_total_workload isn't defined
and their progress (both total and completed) is determined based
on currently created children.

Fixes: https://github.com/scylladb/scylladb/issues/8392.
Fixes: https://github.com/scylladb/scylladb/issues/6406.
Fixes: https://github.com/scylladb/scylladb/issues/7845.

New feature, no backport needed

Closes scylladb/scylladb#15158

* github.com:scylladb/scylladb:
  test: add compaction task progress test
  compaction: set progress unit for compaction tasks
  compaction: find expected workload for reshard tasks
  compaction: find expected workload for global cleanup compaction tasks
  compaction: find expected workload for global major compaction tasks
  compaction: find expected workload for keyspace compaction tasks
  compaction: find expected workload for shard compaction tasks
  compaction: find expected workload for table compaction tasks
  compaction: return empty progress when compaction_size isn't set
  compaction: update compaction_data::compaction_size at once
  tasks: do not check expected workload for done task
2025-09-03 13:23:42 +03:00
Łukasz Paszkowski
3d03b88719 database: Add critical_disk_utilization mode database can be moved to
When database operates in the critical disk utilization mode, all
mutation writes including inserts, updates, deletes, counter updates,
hints, read+repair, lwt writes) to user tables and other associated
with them tables like views, CDC log, audit are rejected, with a clear
error exception returned.

The mode is meant to be used with the disk space monitor in order
to prevent any user writes when node's disk utilization is too high.
2025-08-29 13:46:45 +02:00
Aleksandra Martyniuk
926753e8bf compaction: find expected workload for table compaction tasks
Add compaction_task_impl::get_table_task_workload that sums
the bytes in all sstables in the table.

This function is used to find the expected workload of the following
compaction types:
- major;
- cleanup;
- offstrategy;
- upgrade_sstables;
- scrub.
2025-08-28 10:41:22 +02:00
Dawid Mędrek
837d267cbf main: Log RF-rack-invalid keyspaces at startup
When the configuration option `rf_rack_valid_keyspaces` is enabled and there
is an RF-rack-invalid keyspace, starting a node fails. However, when the
configuration option is disabled, but there still is a keyspace that violates
the condition, we'd like Scylla to print a warning informing the user about
the fact. That's what happens in this commit.

We provide a validation test.
2025-08-21 19:35:33 +02:00
Avi Kivity
611918056a Merge 'repair: Add tablet incremental repair support' from Asias He
The central idea of incremental repair is to allow repair participants
to select and repair only a portion of the dataset to speed up the
repair process. All repair participants must utilize an identical
selection method to repair and synchronize the same selected dataset.
There are two primary selection methods: time-based and file-based. The
time-based method selects data within a specified time frame. It is
versatile but it is less efficient because it requires reading all of
the dataset and omitting data beyond the time frame. The file-based
method selects data from unrepaired SSTables and is more efficient
because it allows the entire SSTable to be omitted. This document patch
implements the file-based selection method.

Incremental repair will only be supported for tablet tables; it will not
be supported for vnode tables. On one hand, the legacy vnode is less
important to support. On the other hand, the incremental repair for
vnode is much harder to implement. With vnodes, a SSTalbe could contain
data for multiple vnode ranges. When a given vnode range is repaired,
only a portion of the SSTable is repaired. This complicates the
manipulation of SSTables significantly during both repair and
compaction. With tablets, an entire tablet is repaired so that a
sstable is either fully repaired or not repaired which is a huge
simplification.

This patch uses the repaired_at from sstables::statistics component to
mark a sstable as repaired. It uses a virtual clock as the repair
timestamp, i.e., using a monotonically increasing number for the
repaired_at field of a SSTable and sstables_repaired_at column in
system.tablets table. Notice that when a sstable is not repaired, the
repaired_at field will be set to the default value 0 by default. The
being_repaired in memory field of a SSTable is used to explicitly mark
that a SSTable is being selected. The following variables are used for
incremental repair:

The repaired_at on disk field of a SSTable is used.
   - A 64-bit number increases sequentially

The sstables_repaired_at is added to the system.tablets table.
   - repaired_at <= sstables_repaired_at means the sstable is repaired

The being_repaired in memory field of a SSTable is added.
   - A repair UUID tells which sstable has participated in the repair

Initial test results:

    1) Medium dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~500GB
    Cluster pre-populated with ~500GB of data before starting repairs job.
    Results for Repair Timings:
    The regular repair run took 210 mins.
    Incremental repair 1st run took 183 mins, 2nd and 3rd runs took around 48s
    The speedup is: 183 mins  / 48s = 228X

    2) Small dataset results
    Node amount: 3
    Instance type: i4i.2xlarge
    Disk usage per node: ~167GB
    Cluster pre-populated with ~167GB of data before starting the repairs job.
    Regular repair 1st run took 110s,  2nd and 3rd runs took 110s.
    Incremental repair 1st run took 110 seconds, 2nd and 3rd run took 1.5 seconds.
    The speedup is: 110s / 1.5s = 73X

    3) Large dataset results
    Node amount: 6
    Instance type: i4i.2xlarge, 3 racks
    50% of base load, 50% read/write
    Dataset == Sum of data on each node

    Dataset     Non-incremental repair (minutes)
    1.3 TiB     31:07
    3.5 TiB     25:10
    5.0 TiB     19:03
    6.3 TiB     31:42

    Dataset     Incremental repair (minutes)
    1.3 TiB     24:32
    3.0 TiB     13:06
    4.0 TiB     5:23
    4.8 TiB     7:14
    5.6 TiB     3:58
    6.3 TiB     7:33
    7.0 TiB     6:55

Fixes #22472

Closes scylladb/scylladb#24291

* github.com:scylladb/scylladb:
  replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
  compaction: Move compaction_reenabler to compaction_reenabler.hh
  topology_coordinator: Make rpc::remote_verb_error to warning level
  repair: Add metrics for sstable bytes read and skipped from sstables
  test.py: Disable incremental for test_tombstone_gc_for_streaming_and_repair
  test.py: Add tests for tablet incremental repair
  repair: Add tablet incremental repair support
  compaction: Add tablet incremental repair support
  feature_service: Add TABLET_INCREMENTAL_REPAIR feature
  tablet_allocator: Add tablet_force_tablet_count_increase and decrease
  repair: Add incremental helpers
  sstable: Add being_repaired to sstable
  sstables: Add set_repaired_at to metadata_collector
  mutation_compactor: Introduce add operator to compaction_stats
  tablet: Add sstables_repaired_at to system.tablets table
  test: Fix drain api in task_manager_client.py
2025-08-19 13:13:22 +03:00
Asias He
082bc70a0a replica: Introduce get_compaction_reenablers_and_lock_holders_for_repair
It helps to hide the compaction_group_views from repair subsystem.
2025-08-18 11:01:22 +08:00
Asias He
be15972006 compaction: Move compaction_reenabler to compaction_reenabler.hh
So it can be used without bringing the whole
compaction/compaction_manager.hh.
2025-08-18 11:01:22 +08:00
Asias He
f9021777d8 compaction: Add tablet incremental repair support
This patch addes incremental_repair support in compaction.

- The sstables are split into repaired and unrepaired set.

- Repaired and unrepaired set compact sperately.

- The repaired_at from sstable and sstables_repaired_at from
  system.tablets table are used to decide if a sstable is repaired or
  not.

- Different compactions tasks, e.g., minor, major, scrub, split, are
  serialized with tablet repair.
2025-08-18 11:01:21 +08:00
Avi Kivity
66173c06a3 Merge 'Eradicate the ability to create new sstables with numerical sstable generation' from Benny Halevy
Remove support for generating numerical sstable generation for new sstables.
Loading such sstables is still supported but new sstables are always created with a uuid generation.
This is possible since:
* All live versions (since 5.4 / f014ccf369) now support uuid sstable generations.
* The `uuid_sstable_identifiers_enabled` config option (that is unused from version 2025.2 / 6da758d74c) controls only the use of uuid generations when creating new sstables. SSTables with uuid generations should still be properly loaded by older versions, even if `uuid_sstable_identifiers_enabled` is set to `false`.

Fixes #24248

* Enhancement, no backport needed

Closes scylladb/scylladb#24512

* github.com:scylladb/scylladb:
  streaming: stream_blob: use the table sstable_generation_generator
  replica: distributed_loader: process_upload_dir: use the table sstable_generation_generator
  sstables: sstable_generation_generator: stop tracking highest generation
  replica: table: get rid of update_sstables_known_generation
  sstables: sstable_directory: stop tracking highest_generation
  replica: distributed_loader: stop tracking highest_generation
  sstables: sstable_generation: get rid of uuid_identifiers bool class
  sstables_manager: drop uuid_sstable_identifiers
  feature_service: move UUID_SSTABLE_IDENTIFIERS to supported_feature_set
  test: cql_query_test: add test_sstable_load_mixed_generation_type
  test: sstable_datafile_test: move copy_directory helper to test/lib/test_utils
  test: database_test: move table_dir helper to test/lib/test_utils
2025-08-14 11:54:33 +03:00
Botond Dénes
9d00d7e08d replica/memtable_list: add tombstone_gc_state* member
To be passed down to the memtable.
2025-08-11 07:09:19 +03:00
Botond Dénes
1d3a3163a3 replica/memtable_list: s/min_live_timestamp()/get_max_purgeable()/
Also change to the return type to max_purgeable, instead of raw
timestamp. Prepares for further patching of this code.
2025-08-11 07:09:13 +03:00