Commit Graph

1743 Commits

Author SHA1 Message Date
Avi Kivity
9696ee64d0 database: fix overflow when computing data distribution over shards
We store the per-shard chunk count in a uint64_t vector
global_offset, and then convert the counts to offsets with
a prefix sum:

```c++
        // [1, 2, 3, 0] --> [0, 1, 3, 6]
        std::exclusive_scan(global_offset.begin(), global_offset.end(), global_offset.begin(), 0, std::plus());
```

However, std::exclusive_scan takes the accumulator type from the
initial value, 0, which is an int, instead of from the range being
iterated, which is of uint64_t.

As a result, the prefix sum is computed as a 32-bit integer value. If
it exceeds 0x8000'0000, it becomes negative. It is then extended to
64 bits and stored. The result is a huge 64-bit number. Later on
we try to find an sstable with this chunk and fail, crashing on
an assertion.

An example of the failure can be seen here: https://godbolt.org/z/6M8aEbo57

The fix is simple: the initial value is passed as uint64_t instead of int.

Fixes https://github.com/scylladb/scylladb/issues/27417

Closes scylladb/scylladb#27418
2025-12-04 14:10:53 +01:00
Botond Dénes
384bffb8da Merge 'compaction: limit the maximum shares allocated to a compaction scheduling class' from Raphael Raph Carvalho
This PR adds support for limiting the maximum shares allocated to a
compaction scheduling class by the compaction controller. It introduces
a new configuration parameter, compaction_max_shares, which, when set
to a non zero value, will cap the shares allocated to compaction jobs.
This PR also exposes the shares computed by the compaction controller
via metrics, for observability purposes.

Fixes https://github.com/scylladb/scylladb/issues/9431

Enhancement. No need to backport.

NOTE: Replaces PR https://github.com/scylladb/scylladb/pull/26696

Ran a test in which the backlog raised the need for max shares (normalized backlog above normalization_factor), and played with different values for new option compaction_max_shares to see it works (500, 1000, 2000, 250, 50)

Closes scylladb/scylladb#27024

* github.com:scylladb/scylladb:
  db/config: introduce new config parameter `compaction_max_shares`
  compaction_manager:config: introduce max_shares
  compaction_controller: add configurable maximum shares
  compaction_controller: introduce `set_max_shares()`
2025-11-26 06:51:30 +02:00
Lakshmi Narayanan Sreethar
853811be90 compaction_controller: introduce set_max_shares()
Add a method to dynamically adjust the maximum output of control points
in the compaction controller. This is required for supporting runtime
configuration of the maximum shares allocated to the compaction process
by the controller.

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>
2025-11-24 11:43:20 -03:00
Aleksandra Martyniuk
19a7d8e248 replica: database: change type of tables_metadata::_ks_cf_to_uuid
If there is a lot of tables, a node reports oversized allocation
in _ks_cf_to_uuid of type flat_hash_map.

Change the type to std::unordered_map to prevent oversized allocations.

Fixes: https://github.com/scylladb/scylladb/issues/26787.

Closes scylladb/scylladb#27165
2025-11-24 06:42:40 +02:00
Radosław Cybulski
d589e68642 Add precompiled headers to CMakeLists.txt
Add precompiled header support to CMakeLists.txt and configure.py -
it improves compilation time by approximately 10%.

New header `stdafx.hh` is added, don't include it manually -
the compiler will include it for you. The header contains includes from
external libraries used by Scylla - seastar, standard library,
linux headers and zlib.

The feature is enabled by default, use CMake option `Scylla_USE_PRECOMPILED_HEADER`
or configure.py --disable-precompiled-header to disable.

The feature should be disabled, when trying to check headers - otherwise
you might get false negatives on missing includes from seastar / abseil and so on.

Note: following configuration needs to be added to ccache.conf:

    sloppiness = pch_defines,time_macros,include_file_mtime,include_file_ctime

Closes scylladb/scylladb#26617
2025-11-21 12:27:41 +02:00
Botond Dénes
0cc5208f8e Merge 'Add sstables_manager::config' from Pavel Emelyanov
Currently sstables_manager keeps a reference on global db::config to configure itself. Most of other services use their own specific configs with much less data on-board for the same purposes (e.g. #24841, #19051 and #23705 did same for other services) This PR applies this approach to sstables_manager as well.

Mostly it moves various values from db::config onto newly introduced struct sstables_manager::config, but it also adds specific tracking of sstable_file_io_extensions and patches tools/scylla-sstable not to use sstables_manager as "proxy" object to get db::config from along its calls.

Shuffling components dependencies, no need to backport

Closes scylladb/scylladb#27021

* github.com:scylladb/scylladb:
  sstables_manager: Drop db::config from sstables_manager
  tools/sstable: Make shard_of_with_tablets use db::config argument
  tools/sstable: Add db::config& to all operations
  tools/sstable: Get endpoints from storage manager
  sstables_manager: Hold sstable IO extensions on it
  sstables: Manager helper to grab file io extensions
  sstables_manager: Move default format on config
  sstables_manager: Move enable_sstable_data_integrity_check on config
  sstables_manager: Move data_file_directories on config
  sstables_manager: Move components_memory_reclaim_threshold on config
  sstables_manager: Move column_index_auto_scale_threshold on config
  sstables_manager: Move column_index_size on config
  sstables_manager: Move sstable_summary_ratio on config
  sstables_manager: Move enable_sstable_key_validation on config
  sstables_manager: Move available_memory on config
  code: Introduce sstables_manager::config
  sstables: Patch get_local_directories() to work on vector of paths
  code: Rename sstables_manager::config() into db_config()
2025-11-21 10:21:41 +02:00
Raphael S. Carvalho
74ecedfb5c replica: Fail timed-out single-key read on cleaned up tablet replica
Consider the following:
1) single-key read starts, blocks on replica e.g. waiting for memory.
2) the same replica is migrated away
3) single-key read expires, coordinator abandons it, releases erm.
4) migration advances to cleanup stage, barrier doesn't wait on
   timed-out read
5) compaction group of the replica is deallocated on cleanup
6) that single-key resumes, but doesn't find sstable set (post cleanup)
7) with abort-on-internal-error turned on, node crashes

It's fine for abandoned (= timed out) reads to fail, since the
coordinator is gone.
For active reads (non timed out), the barrier will wait for them
since their coordinator holds erm.
This solution consists of failing reads which underlying tablet
replica has been cleaned up, by just converting internal error
to plain exception.

Fixes #26229.

Signed-off-by: Raphael S. Carvalho <raphaelsc@scylladb.com>

Closes scylladb/scylladb#27078
2025-11-20 11:44:03 +02:00
Botond Dénes
6ee0f1f3a7 Merge 'replica/table: add a metric for hypothetical total file size without compression' from Michał Chojnowski
This patch adds a metric for pre-compression size of sstable files.

This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.

As for the implementation:
Before the patch, tables and sstable sets are already tracking their total physical file size.
Whenever sstables are added or removed, the size delta is propagated from the sstable up through sstable sets into table_stats.
To implement the new metric, we turn the size delta that is getting passed around from a one-dimensional to a two-dimensional value, which includes both the physical and the pre-compression size.

New functionality, no backport needed.

Closes scylladb/scylladb#26996

* github.com:scylladb/scylladb:
  replica/table: add a metric for hypothetical total file size without compression
  replica/table: keep track of total pre-compression file size
2025-11-20 09:10:38 +02:00
Pavel Emelyanov
9cb776dee8 sstables_manager: Drop db::config from sstables_manager
Now it has all it needs via its own specific config.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
675eb3be98 sstables_manager: Hold sstable IO extensions on it
Currently manager holds a reference on db::config and when sstables IO
extensions are needed it grabs them from this config. Since db::config
is going to be removed from sstables manager, it should either keep
track of all config extensions, or only those that it needs. This patch
makes the latter choice and keeps reference to sstable_file_io_ext. on
manager. The reference is passed as constructor argument, not via
manager config, but it's a random choice, no specific reason why not
putting it on config itself.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
9868341c73 sstables_manager: Move default format on config
It's explicitly `me` type by default, but places that can write sstables
override it with db::config value: replica::database, tests and scylla
sstable tool.

Live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
e6dee8aab5 sstables_manager: Move enable_sstable_data_integrity_check on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
78ab31118e sstables_manager: Move data_file_directories on config
Make it a reference, so all the code that configures it is updated to
provide the target.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:50 +03:00
Pavel Emelyanov
cb1679d299 sstables_manager: Move components_memory_reclaim_threshold on config
Set its default value to the one from db/config.cc. Only the
replica::database and tests may want to re-configure it.

This one is live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 19:31:42 +03:00
Botond Dénes
8579e20bd1 Merge 'Enable digest+checksum verification for streaming/repair' from Taras Veretilnyk
This PR enables integrity check of both checksum and digest for repair/streaming.
In the past, streaming readers only verified the checksum of compressed SSTables.

This change extends the checks to include the digest and the checksum (CRC) for both compressed and uncompressed SSTables. These additional checks require reading the digest and CRC components from disk, which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data, while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.If the reader range doesn't cover the full SSTable, the digest is not loaded and check is skipped.

To support testing of these changes, a new option was added to the random_mutation_generator that allows disabling compression.
Several new test cases were added to verify that the repair_reader correctly detects corruption. These tests corrupt digest or data component of an SSTable and confirm that the system throws the expected `malformed_sstable_exception`.

Backport is not required, it is an improvement

Refs #21776

Closes scylladb/scylladb#26444

* github.com:scylladb/scylladb:
  boost/repair_test: add repair reader integrity verification test cases
  test/lib: allow to disable compression in random_mutation_generator
  sstables: Skip checksum and digest reads for unlinked SSTables
  table: enable integrity checks for streaming reader
  table: Add integrity option to table::make_sstable_reader()
  sstables: Add integrity option to create_single_key_sstable_reader
2025-11-14 18:00:33 +02:00
Pavel Emelyanov
604e5b6727 sstables_manager: Move column_index_auto_scale_threshold on config
Set its default value to the one from db/config.cc. Only the
replica::database may want to re-configure it.

This one is live-updateable, so use updateable_value<> type.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:30:49 +03:00
Pavel Emelyanov
8f9f92728e sstables_manager: Move column_index_size on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:30:28 +03:00
Pavel Emelyanov
88bb203c9c sstables_manager: Move sstable_summary_ratio on config
Set its default value to the one from db/config.cc. Only
replica::database may want to re-configure it. Also not live-updateable.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:29:34 +03:00
Pavel Emelyanov
1f6918be3f sstables_manager: Move enable_sstable_key_validation on config
Make it OFF by default and update only those callers, that may have it
ON -- the replica::database, tests and scylla-sstable tool.

Also not live-updateable, so plain bool.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:28:14 +03:00
Pavel Emelyanov
79d0f93693 sstables_manager: Move available_memory on config
Currently, this parameter is passed to sstables_manager as explicit
constructor argument.

Also, it's not live-updateable, so a plain size_t type for it.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:27:14 +03:00
Pavel Emelyanov
218916e7c2 code: Introduce sstables_manager::config
This is specific configuration for sstables_manager. All places that
construct sstables manager are updated to provide config to it. For now
the config is empty and exists alongside with db::config. Further
patches will populate the former config with data and the latter config
will be eventually removed.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-11-14 14:25:18 +03:00
Michał Chojnowski
346e0f64e2 replica/table: add a metric for hypothetical total file size without compression
This patch adds a per-table metric
`scylla_column_family_total_disk_space_before_compression`,
which measures the hypothetical total size of sstables on disk,
if Data.db was replaced with an uncompressed equivalent.
2025-11-13 11:28:19 +01:00
Michał Chojnowski
1cfce430f1 replica/table: keep track of total pre-compression file size
Every table and sstable set keeps track of the total file size
of contained sstables.

Due to a feature request, we also want to keep track of the hypothetical
file size if Data files were uncompressed, to add a metric that
shows the compression ratio of sstables.

We achieve this by replacing the relevant `uint_64 bytes_on_disk`
counters everywhere with a struct that contains both the actual
(post-compression) size and the hypothetical pre-compression size.

This patch isn't supposed to change any observable behavior.
In the next patch, we will use these changes to add a new metric.
2025-11-13 00:49:57 +01:00
Michael Litvak
de321218bc storage_proxy: apply counter mutation on all write shards
When applying a counter mutation, use apply_on_shards to apply the
mutation on all write shards, similarly to the way other mutations are
applied in the storage proxy. Previously the mutation was applied only
on the current shard which is the read shard.

This is needed to respect the write_both stages of intranode migration
where we need to apply the mutation on both the old and the new shards.
2025-11-03 16:03:29 +01:00
Michael Litvak
c7e7a9e120 storage_proxy: move counter update coordination to storage proxy
Refactor the counter update to split the functions and have them called
by the storage proxy to prepare for a later change.

Previously in mutate_counter the storage proxy calls the replica
function apply_counter_update that does a few things:
1. checks that the operation can be done: check timeout, disk utilization
2. acquire counter locks
3. do read-modify-write and transform the counter mutation
4. apply the mutation in the replica

In this commit we change it so that these functions are split and called
from the storage proxy, so that we have better control from the storage
proxy when we change it later to work across multiple shards. For
example, we will want to acquire locks on multiple shards, transform it
on one shard, and then apply the mutation on multiple shards.

After the change it works as follows in storage proxy:
1. acquire counter locks
2. call replica prepare to check the operation and transform the mutation
3. call replica apply to apply the transformed mutation
2025-11-03 15:59:46 +01:00
Michael Litvak
7cc6b0d960 replica/db: add counter update guard
Add a RAII guard for counter update that holds the counter locks and the
table operation, and extract the creation of the guard to a separate
function.

This prepares it for a later change where we will want to obtain the
guard externally from the storage proxy.
2025-11-03 08:43:11 +01:00
Michael Litvak
88fd9a34c4 replica/db: split counter update helper functions
Split do_apply_counter_update to a few smaller and simpler functions to
help prepare for a later change.
2025-11-03 08:43:11 +01:00
Lakshmi Narayanan Sreethar
3eb7193458 backlog_controller: compute backlog even when static shares are set
The compaction manager backlog is exposed via metrics, but if static
shares are set, the backlog is never calculated. As a result, there is
no way to determine the backlog and if the static shares need
adjustment. Fix that by calculating backlog even when static shares are
set.

Fixes #26287

Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>

Closes scylladb/scylladb#26778
2025-10-31 18:18:36 +02:00
Tomasz Grabiec
1c0d847281 Merge 'load_balancer: load_stats reconcile after tablet migration and table resize' from Ferenc Szili
This change adds the ability to move tablets sizes in load_stats after a tablet migration or table resize (split/merge). This is needed because the size based load balancer needs to have tablet size data which is as accurate as possible, in order to work on fresh tablet size distribution and issue correct tablet migrations.

This is the second part of the size based load balancing changes:

- First part for tablet size collection via load_stats: #26035
- Second part reconcile load_stats: #26152
- The third part for load_sketch changes: #26153
- The fourth part which performs tablet load balancing based on tablet size: #26254

This is a new feature and backport is not needed.

Closes scylladb/scylladb#26152

* github.com:scylladb/scylladb:
  load_balancer: load_stats reconcile after tablet migration and table resize
  load_stats: change data structure which contains tablet sizes
2025-10-31 09:58:25 +01:00
Taras Veretilnyk
e62ebdb967 table: enable integrity checks for streaming reader
Previously, streaming readers only verified the checksum of compressed SSTables.
This patch extends checks to also include the digest and the uncompressed checksum (CRC).

These additional checks require reading the digest and CRC components from disk,
which may cause some I/O overhead. For uncompressed SSTables, this involves loading and computing checksums and digest from the data,
while for compressed SSTables - where checksums are already verified inline - the only extra cost is reading and verifying the digest.
If the reader range doesn't cover the full SSTable, the digest check is skipped.
2025-10-28 19:27:35 +01:00
Taras Veretilnyk
06e1b47ec6 table: Add integrity option to table::make_sstable_reader() 2025-10-28 19:27:35 +01:00
Taras Veretilnyk
deb8e32e86 sstables: Add integrity option to create_single_key_sstable_reader
Added an sstables::integrity_check parameter to create_single_key_sstable_reader methods across its implementations.
This allows callers to enable SSTable integrity checks during single-key reads.
2025-10-28 19:27:35 +01:00
Avi Kivity
d81796cae3 Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

Fixes https://github.com/scylladb/scylladb/issues/25341

Closes scylladb/scylladb#25456

* github.com:scylladb/scylladb:
  mv: limit concurrent view updates from all sources
  database: rename _view_update_concurrency_sem to _view_update_memory_sem
2025-10-28 11:13:24 +02:00
Wojciech Mitros
f07a86d16e mv: limit concurrent view updates from all sources
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.

Fixes https://github.com/scylladb/scylladb/issues/25341
2025-10-27 18:55:41 +01:00
Patryk Jędrzejczak
e1c3f666c9 Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev
Problems addressed by this PR

* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
  * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
  * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
  * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
  * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.

Fixes scylladb/scylladb#26150

backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting

Closes scylladb/scylladb#26315

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
  test_fencing: fix due to new version increment
  test_automatic_cleanup: clean it up
  storage_proxy: wait for closing sessions in sstable cleanup fiber
  storage_proxy: rename await_pending_writes -> await_stale_pending_writes
  storage_proxy: use run_fenceable_write
  storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
  storage_proxy: introduce run_fenceable_write
  storage_proxy: move update_fence_version from shared_token_metadata
  storage_proxy: fix start_write() operation scope in apply_locally
  storage_proxy: move post fence check into handle_write
  storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
  storage_proxy::handle_read: add fence check before get_schema
  storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
  sstable_cleanup_fiber: use coroutine::parallel_for_each
  storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
  topology_coordinator: barrier before cleanup
  topology_coordinator: small start_cleanup refactoring
  global_token_metadata_barrier: add fenced flag
2025-10-27 12:35:13 +01:00
Avi Kivity
b843d8bc8b Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.

The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.

Fixes: https://github.com/scylladb/scylladb/issues/26506

New feature, no backport.

Closes scylladb/scylladb#26515

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
  replica/mutation_dump: add support for virtual tables
  tools/scylla-sstable: print_query_results_json(): handle empty value buffer
  tools/scylla-sstable: add cql support to write operation
  tools/scylla-sstable: write_operation(): fix indentation
  tools/scylla-sstable: write_operation(): prepare for a new input-format
  tools/scylla-sstable: generalize query_operation_validate_query()
  tools/scylla-sstable: move query_operation_validate_query()
  tools/scylla-sstable: extract schema transformation from query operation
  replica/table: add virtual write hook to the other apply() overload too
2025-10-24 23:32:40 +03:00
Avi Kivity
997b52440e Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.

Fixes scylladb/scylladb#23707.

Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.

Closes scylladb/scylladb#26227

* github.com:scylladb/scylladb:
  replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
  replica: add tombstone_gc_enabled parameter to mutation query methods
  mutation/mutation_compactor: remove _can_gc member
  tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
2025-10-24 23:26:16 +03:00
Ferenc Szili
b4ca12b39a load_stats: change data structure which contains tablet sizes
This patch changes the tablet size map in load_stats. Previously, this
data structure was:

std::unordered_map<range_based_tablet_id, uint64_t> tablet_sizes;

and is changed into:

std::unordered_map<table_id, std::unordered_map<dht::token_range, uint64_t>> tablet_sizes;

This allows for improved performance of tablet tablet size reconciliation.
2025-10-24 14:37:00 +02:00
Wojciech Mitros
c0d0f8f85b database: rename _view_update_concurrency_sem to _view_update_memory_sem
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
2025-10-23 10:00:15 +02:00
Petr Gusev
22271b9fe7 test_automatic_cleanup: add test_cleanup_waits_for_stale_writes 2025-10-22 16:31:43 +02:00
Łukasz Paszkowski
7ec369b900 database: Log message after critical_disk_utilization mode is set
This is a follow-up of the previous fix: https://github.com/scylladb/scylladb/pull/26030

The test test_user_writes_rejection starts a 3-node cluster and
creates a large file on one of the nodes, to trigger the out-of-space
prevention mechanism, which should reject writes on that node.

It waits for the log message 'Setting critical disk utilization mode: true'
and then executes a write expecting the node to reject it.

Currently, the message is logged before the `_critical_disk_utilization`
variable is actually updated. This causes the test to fail sporadically
if it runs quickly enough.

The fix splits the logging into two steps:
1. "Asked to set critical disk utilization mode" - logged before any action
2) "Set critical disk utilization mode" - logged after `_critical_disk_utilization` has been updated

The tests are updated to wait for the second message.

Fixes https://github.com/scylladb/scylladb/issues/26004

Closes scylladb/scylladb#26392
2025-10-20 13:24:10 +03:00
Pavel Emelyanov
44ed3bbb7c Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.

Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.

This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.

Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).

Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.

Fixes #25359
Fixes #26453

Closes scylladb/scylladb#26186

* github.com:scylladb/scylladb:
  docs::dev::object_storage: Add some initial info on GS storage
  docs/dev: Add mention of (nested) docker usage in testing.md
  sstables::object_storage_client: Forward memory limit semaphore to GS instance
  utils::gcp::object_storage: Add optional memory limits to up/download
  sstables::object_storage_client: Add multi-upload support for GS
  utils::gcp::storage: Add merge objects operation
  test_backup/test_basic: Make tests multiplex both s3 and gs backends
  test::cluster::conftest: Add support for multiple object storage backends
  boost::gcs_storage_test: reindent
  boost::gcs_storage_test: Convert to use fixture
  tests::boost: Add GS object storage cases to mirror S3 ones
  tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
  tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
  sstables::object_storage_client: Add google storage implementation
  test_services: Allow testing with GS object storage parameters
  utils::gcp::gcp_credentials: Add option to create uninitialized credentials
  utils::gcp::object_storage: Make create_download_source return seekable_data_source
  utils::gcp::object_storage: Add defensive copies of string_view params
  utils::gcp::object_storage: Add missing retry backoff increate
  utils::gcp::object_storage: Add timestamp to object listing
  utils::gcp::object_storage: Add paging support to list_objects
  object_storage_client: Add object_name wrapper type
  utils::gcp::object_storage: Add optional abort_source
  utils::rest::client: Add abort_source support
  sstables: Use object_storage_client for remote storage
  sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
  s3::upload_progress: Promote to general util type
  storage_options: Abstract s3 to "object_storage" and add gs as option
  sstables::file_io_extension: Change "creator" callback to just data_source
  utils::io-wrappers: Add ranged data_source
  utils::io-wrappers: Add file wrapper type for seekable_source
  utils::seekable_source: Add a seekable IO source type
  object_storage_endpoint_param: Add gs storage as option
  config: break out object_storage_endpoint_param preparing for multi storage
2025-10-20 13:14:53 +03:00
Piotr Dulikowski
70b0cfb13e Merge 'test: cluster: Replica exceptions tests' from Dario Mirovic
This patch series introduces several tests that check number of exceptions that happens during various replica operations. The goal is to have a set of tests that can catch situations where number of exceptions per operation increases. It makes exception throw regressions easier to catch.

The tests cover apply counter update and apply functionalities in the database layer.

There are more paths that can be checked, like various semaphore wait timeouts located deeper in the code. This set of tests does not cover all code paths.

Fixes #18164

This is an improvement. No backport needed.

Closes scylladb/scylladb#25992

* github.com:scylladb/scylladb:
  test: cluster: test replica write timeout
  database: parameterize apply_counter_update_delay_5s injector value
  test: cluster: test replica exceptions - test rate limit exceptions
2025-10-20 10:03:31 +03:00
Dario Mirovic
1d93f342f9 test: cluster: test replica write timeout
This patch introduces test `test_replica_database_apply_timeout`.
It tests timeout on database write. The test uses error injection
that returns timeout error if the injection `database_apply_force_timeout`
is enabled.

Refs #18164
2025-10-17 11:52:11 +02:00
Dario Mirovic
ff88fe2d76 database: parameterize apply_counter_update_delay_5s injector value
Parameterize `apply_counter_update_delay_5s` injector value. Instead of
sleeping 5s when the injection is active, read parameter value that
specifies sleep duration. To reflect these changes, it is renamed to
`apply_counter_update_delay_ms` and the sleep duration is specified in
milliseconds.

Refs #18164
2025-10-17 11:52:10 +02:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Tomasz Grabiec
e6c427953e Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.

Key changes:

- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.

Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414

Closes scylladb/scylladb#25302

* github.com:scylladb/scylladb:
  db: schema_applier: update tablet metadata atomically
  db: replica: move tables_metadata locking to commit
  storage_service: abstract schema dependecies during token metadata update
  storage_service: split replicate_to_all_cores to steps
  db: schema_applier: unify token_metadata loading
  replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
  service: fix dependencies during migration_manager startup
  db: schema_applier: move pending_token_metadata to locator
  db: always use _tablet_hint as condition for tablet metadata change
  db: refactor new_token_metadata into pending_token_metadata
  db: rename new_token_metadata to pending_token_metadata
  db: schema_applier: move types storage init to merge_types func
  db: schema_applier: make merge functions non-static members
  db: remove unused proxy from create_keyspace_metadata
2025-10-16 21:43:49 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00
Marcin Maliszkiewicz
e5fffa158f db: replica: move tables_metadata locking to commit
This keeps the locking scope minimal, and since
unlocking is done in commit(), locking fits here as well.
2025-10-16 10:56:10 +02:00
Botond Dénes
5d70450917 replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
Make use of the freshly introduced facility to disable
garbage-collection on a per-query basis for range scans. This is needed
so partitions that only contain garbage-collectible data are not missing
from the partition-list. When using SELECT * FROM MUTATION_FRAGMENTS(),
the user is expecting to see *all* data, even that which is dead and
garbage-collectible.

Include a test which reproduces the issue.
2025-10-16 10:40:28 +03:00