Commit Graph

4596 Commits

Author SHA1 Message Date
Botond Dénes
2ca66133a4 Revert "db/config: don't use RBNO for scaling"
This reverts commit 43738298be.

This commit causes instability in dtests. Several non-gating dtests
started failing, as well as some gating ones, see #27047.

Closes scylladb/scylladb#27067

Fixes #27047
2025-11-18 08:17:17 +02:00
Botond Dénes
514c1fc719 Merge 'db: batchlog_manager: update _last_replay only if all batches were re…' from Aleksandra Martyniuk
…played

Currently, if flushing hints falls within the repair cache timeout, then the flush_time is set to batchlog_manager::_last_replay. _last_replay is updated on each replay, even if some batches weren't replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.

Needs backport to all live versions.

Closes scylladb/scylladb#26793

* github.com:scylladb/scylladb:
  test: extend test_batchlog_replay_failure_during_repair
  db: batchlog_manager: update _last_replay only if all batches were replayed
2025-11-18 08:17:16 +02:00
Piotr Dulikowski
f0039381d2 Merge 'db/view/view_building_worker: support staging sstables intra-node migration and tablet merge' from Michał Jadwiszczak
This PR fixes staging stables handling by view building coordinator in case of intra-node tablet migration or tablet merge.

To support tablet merge, the worker stores the sstables grouped only be `table_id`, instead of `(table_id, last_token)` pair.
There shouldn't be that many staging sstables, so selecting relevant for each `process_staging` task is fine.
For the intra-node migration support, the patch adds methods to load migrated sstables on the destination shard and to cleanup them on source shard.

The patch should be backported to 2025.4

Fixes https://github.com/scylladb/scylladb/issues/26244

Closes scylladb/scylladb#26454

* github.com:scylladb/scylladb:
  service/storage_service: migrate staging sstables in view building worker during intra-node migration
  db/view/view_building_worker: support sstables intra-node migration
  db/view_building_worker: fix indent
  db/view/view_building_worker: don't organize staging sstables by last token
2025-11-17 08:53:19 +01:00
Aleksandra Martyniuk
e3dcb7e827 test: extend test_batchlog_replay_failure_during_repair
Modify test_batchlog_replay_failure_during_repair to also check
that there isn't data resurrection if flushing hints falls within
the repair cache timeout.
2025-11-14 14:18:07 +01:00
Patryk Jędrzejczak
1141342c4f Merge 'topology: refactor excluded nodes' from Petr Gusev
This PR refactors excluded nodes handling for tablets and topology. For tablets a dedicated variable `topology::excluded_tablet_nodes` is introduced, for topology operations a method get_excluded_nodes() is inlined into topology_coordinator and renamed to `get_excluded_nodes_for_topology_request`.

The PR improves codes readability and efficiency, no behavior changes.

backport: this is a refactoring/optimization, no need to backport

Closes scylladb/scylladb#26907

* https://github.com/scylladb/scylladb:
  topology_coordinator: drop unused exec_global_command overload
  topology_coordinator: rename get_excluded_nodes -> get_excluded_nodes_for_topology_request
  topology_state_machine: inline get_excluded_nodes
  messaging_service: simplify and optimize ban_host
  storage_service: topology_state_load: extract topology variable
  topology_coordinator: excluded_tablet_nodes -> ignored_nodes
  topology_state_machine: add excluded_tablet_nodes field
2025-11-14 11:52:00 +01:00
Botond Dénes
43738298be db/config: don't use RBNO for scaling
Remove bootstrap and decomission from allowed_repair_based_node_ops.
Using RBNO over streaming for these operations has no benefits, as they
are not exposed to the out-of-date replica problem that replace,
removenode and rebuild are.
On top of that, RBNO is known to have problems with empty user tables.
Using streaming for boostrap and decomission is safe and faster
than RBNO in all condition, especially when the table is small.

One test needs adjustment as it relies on RBNO being used for all node
ops.

Fixes: #24664

Closes scylladb/scylladb#26330
2025-11-14 13:03:50 +03:00
Piotr Dulikowski
43506e5f28 Merge 'db/view: Add backoff when RPC fails' from Dawid Mędrek
The view building coordinator manages the process by sending RPC
requests to all nodes in the cluster, instructing them what to do.
If processing that message fails, the coordinator decides if it
wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test.

Fixes scylladb/scylladb#26686

Backport: impact on the user. We should backport it to 2025.4.

Closes scylladb/scylladb#26729

* github.com:scylladb/scylladb:
  tet/cluster/mv: Clean up test_backoff_when_node_fails_task_rpc
  db/view/view_building_coordinator: Rate limit logging failed RPC
  db/view: Add backoff when RPC fails
2025-11-14 10:17:57 +01:00
Dawid Mędrek
acd9120181 db/view/view_building_coordinator: Rate limit logging failed RPC
The view building coordinator sends tasks in form of RPC messages
to other nodes in the cluster. If processing that RPC fails, the
coordinator logs the error.

However, since tasks are per replica (so per shard), it may happen
that we end up with a large number of similar messages, e.g. if the
target node has died, because every shard will fail to process its
RPC message. It might become even worse in the case of a network
partition.

To mitigate that, we rate limit the logging by 1 seconds.

We extend the test `test_backoff_when_node_fails_task_rpc` so that
it allows the view building coordinator to have multiple tablet
replica targets. If not for rate limiting the warning messages,
we should start getting more of them, potentially leading to
a test failure.
2025-11-13 17:57:23 +01:00
Dawid Mędrek
4a5b1ab40a db/view: Add backoff when RPC fails
The view building coordinator manages the process of view building
by sending RPC requests to all nodes in the cluster, instructing them
what to do. If processing that message fails, the coordinator decides
if it wants to retry it or (temporarily) abandon the work.

An example of the latter scenario could be if one of the target nodes
dies and any attempts to communicate with it would fail.

Unfortunately, the current approach to it is not perfect and may result
in a storm of warnings, effectively clogging the logs. As an example,
take a look at scylladb/scylladb#26686: the gossiper failed to mark
one of the dead nodes as DOWN fast enough, and it resulted in a warning storm.

To prevent situations like that, we implement a form of backoff.
If processing an RPC message fails, we postpone finishing the task for
a second. That should reduce the number of messages in the logs and avoid
retries that are likely to fail as well.

We provide a reproducer test: it fails before this commit and succeeds
with it.

Fixes scylladb/scylladb#26686
2025-11-13 17:55:41 +01:00
Aleksandra Martyniuk
4d0de1126f db: batchlog_manager: update _last_replay only if all batches were replayed
Currently, if flushing hints falls within the repair cache timeout,
then the flush_time is set to batchlog_manager::_last_replay.
_last_replay is updated on each replay, even if some batches weren't
replayed. Due to that, we risk the data resurrection.

Update _last_replay only if all batches were replayed.

Fixes: https://github.com/scylladb/scylladb/issues/24415.
2025-11-13 10:40:19 +01:00
Piotr Dulikowski
2e5eb92f21 Merge 'cdc: use CDC schema that is compatible with the base schema' from Michael Litvak
When generating CDC log mutations for some base mutation, use a CDC schema that is compatible with the base schema.

The compatible CDC schema has for every base column a corresponding CDC column with the same name. If using a non-compatible schema, we may encounter a situation, especially during ALTER, that we have a mutation with a base column set with some value, but the CDC schema doesn't have a column by that name. This would cause the user request to fail with an error.

We add to the schema object a schema_ptr that for CDC-enabled tables points to the schema object of the CDC table that is compatible with the schema. It is set by the schema merge algorithm when creating the schema for a table that is created or altered. We use the fact that a base table and its CDC table are created and altered in the same group0 operation, and this way we can find and set the cdc schema for a base table.

When transporting the base schema as a frozen schema between shards, we transport with it the frozen cdc schema as well.

The patch starts with a series of refactoring commits that make extending the frozen schema easier and cleans up some duplication in the code about the frozen schema. We combine the two types `frozen_schema_with_base_info` and `view_schema_and_base_info` to a single type `extended_frozen_schema` that holds a frozen schema with additional data that is not part of the schema mutations but needs to be transported with it to unfreeze it - base_info, and the frozen cdc schema which is added in a later commit.

Fixes https://github.com/scylladb/scylladb/issues/26405

backport not needed - enhancement

Closes scylladb/scylladb#24960

* github.com:scylladb/scylladb:
  test: cdc: test cdc compatible schema
  cdc: use compatiable cdc schema
  db: schema_applier: create schema with pointer to CDC schema
  db: schema_applier: extract cdc tables
  schema: add pointer to CDC schema
  schema_registry: remove base_info from global_schema_ptr
  schema_registry: use extended_frozen_schema in schema load
  schema_registry: replace frozen_schema+base_info with extended_frozen_schema
  frozen_schema: extract info from schema_ptr in the constructor
  frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
2025-11-13 10:11:54 +01:00
Petr Gusev
82da83d0e5 topology_state_machine: add excluded_tablet_nodes field
The topology_coordinator::is_excluded() creates a temporary hash
map for each call. This is probably not a performance problem since
left_nodes_rs contains only those left nodes that are referenced
from tablet replicas, this happens temporarily while e.g. a replaced
node is being rebuilt. On the other hand, why not just have a
dedicated field in the topology_state_machine, then this code wouldn't
look suspicious.
2025-11-12 12:27:43 +01:00
Michał Jadwiszczak
4bc6361766 db/view/view_building_worker: support sstables intra-node migration
We need to be able to load sstables on the target shard during
intra-node tablet migration and to cleanup migrated sstables on the
source shard.
2025-11-10 10:36:32 +01:00
Michał Jadwiszczak
c99231c4c2 db/view_building_worker: fix indent 2025-11-10 09:02:16 +01:00
Michał Jadwiszczak
2e8c096930 db/view/view_building_worker: don't organize staging sstables by last token
There was a problem with staging sstables after tablet merge.
Let's say there were 2 tablets and tablet 1 (lower last token)
had an staging sstable. Then a tablet merge occured, so there is only
one tablet now (higher last token).
But entries in `_staging_sstables`, which are grouped by last token, are
never adjusted.

Since there shouldn't be thousands of sstables, we can just hold list of
sstables per table and filter necessary entries when doing
`process_staging` view building task.
2025-11-10 09:02:16 +01:00
Nadav Har'El
25439127c8 config: make tablets_mode_for_new_keyspaces live-updatable
We have a configuration option "tablets_mode_for_new_keyspaces" which
determines whether new keyspaces should use tablets or vnodes.

For some reason, this configuration parameter was never marked live-
updatable, so in this patch we add flag. No other changes are needed -
the existing code that uses this flag always uses it through the
up-to-date configuration.

In the previous patches we start to honor tablets_mode_for_new_keyspaces
also in Alternator CreateTable, and we wanted to test this but couldn't
do this in test/alternator because the option was not live-updatable.
Now that it will be, we'll be able to test this feature in
test/alternator.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-11-09 12:52:29 +02:00
Piotr Szymaniak
af00b59930 Fix incorrect hint for tablets_mode_for_new_keyspaces 2025-11-09 10:49:46 +02:00
Wojciech Mitros
0a22ac3c9e mv: don't mark the view as built if the reader produced no partitions
When we build a materialized view we read the entire base table from start to
end to generate all required view udpates. If a view is created while another view
is being built on the same base table, this is optimized - we start generating
view udpates for the new view from the base table rows that we're currently
reading, and we read the missed initial range again after the previous view
finishes building.
The view building progress is only updated after generating view updates for
some read partitions. However, there are scenarios where we'll generate no
view updates for the entire read range. If this was not handled we could
end up in an infinite view building loop like we did in https://github.com/scylladb/scylladb/issues/17293
To handle this, we mark the view as built if the reader generated no partitions.
However, this is not always the correct conclusion. Another scenario where
the reader won't encounter any partitions is when view building is interrupted,
and then we perform a reshard. In this scenario, we set the reader for all
shards to the last unbuilt token for an existing partition before the reshard.
However, this partition may not exist on a shard after reshard, and if there
are also no partitions with higher tokens, the reader will generate no partitions
even though it hasn't finished view building.
Additionally, we already have a check that prevents infinite view building loops
without taking the partitions generated by the reader into account. At the end
of stream, before looping back to the start, we advance current_key to the end
of the built range and check for built views in that range. This handles the case
where the entire range is empty - the conditions for a built view are:
1. the "next_token" is no greater than "first_token" (the view building process
looped back, so we've built all tokens above "first_token")
2. the "current_token" is no less than "first_token" (after looping back, we've
built all tokens below "first_token")

If the range is empty, we'll pass these conditions on an empty range after advancing
"current_key" to the end because:
1. after looping back, "next_token" will be set to `dht::minimum_token`
2. "current_key" will be set to `dht::ring_position::max()`

In this patch we remove the check for partitions generated by the reader. This fixes
the issue with resharding and it does not resurrect the issue with infinite view building
that the check was introduced for.

Fixes https://github.com/scylladb/scylladb/issues/26523

Closes scylladb/scylladb#26635
2025-11-05 17:02:32 +02:00
Pavel Emelyanov
59019bc9a9 Merge 'Alternator: allow warning on auth errors before enabling enforcement' from Nadav Har'El
An Alternator user was recently "bit" when switching `alternator_enforce_authorization` from "false" to "true": ְְְAfter the configuration change, all application requests suddenly failed because unbeknownst to the user, their application used incorrect secret keys.

This series introduces a solution for users who want to **safely** switch `alternator_enforce_authorization`  from "false" to "true": Before switching from "false" to "true", the user can temporarily switch a new option, `alternator_warn_authorization`, to true. In this "warn" mode, authentication and authorization errors are counted in metrics (`scylla_alternator_authentication_failures` and `scylla_alternator_authorization_failures`) and logged as WARNings, but the user's application continues to work. The user can use these metrics or log messages to learn of errors in their application's setup, fix them, and only do the switch of `alternator_enforce_authorization` when the metrics or log messages show there are no more errors.

The first patch is the implementation of the the feature - the new configuration option, the metrics and the log messages,  the second patch is a test for the new feature, and the third patch is documentation recommending how to use the warn mode and the associated metrics or log messages to safely switch `alternaor_enforce_authorization` from false to true.

Fixes #25308

This is a feature that users need, so it should probably be backported to live branches.

Closes scylladb/scylladb#25457

* github.com:scylladb/scylladb:
  docs/alternator: explain alternator_warn_authorization
  test/alternator: tests for new auth failure metrics and log messages
  alternator: add alternator_warn_authorization config
2025-11-05 10:45:17 +03:00
Avi Kivity
d458dd41c6 Merge 'Avoid input_/output_stream-s default initialization and move-assignment' from Pavel Emelyanov
Recent seastar update deprecated in/out streams usage pattern when a stream is default constructed early and them move-assigned with the proper one (see scylladb/seastar#3051). This PR fixes few places in Scylla that still use one.

Adopting newer seastar API, no need to backport

Closes scylladb/scylladb#26747

* github.com:scylladb/scylladb:
  commitlog: Remove unused work::r stream variable
  ec2_snitch: Fix indentation after previous patch
  ec2_snitch: Coroutinize the aws_api_call_once()
  sstable: Construct output_stream for data instantly
  test: Don't reuse on-stack input stream
2025-10-31 21:22:41 +02:00
Avi Kivity
adf9c426c2 Merge 'db/config: Change default SSTable compressor to LZ4WithDictsCompressor' from Nikos Dragazis
`sstable_compression_user_table_options` allows configuring a node-global SSTable compression algorithm for user tables via scylla.yaml. The current default is LZ4Compressor (inherited from Cassandra).

Make LZ4WithDictsCompressor the new default. Metrics from real datasets in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster (e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the feature is enabled, flip the default back to the dictionary compressor using with a listener callback.

Fixes #26610.

Closes scylladb/scylladb#26697

* github.com:scylladb/scylladb:
  test/cluster: Add test for default SSTable compressor
  db/config: Change default SSTable compressor to LZ4WithDictsCompressor
  db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
  boost/cql_query_test: Get expected compressor from config
2025-10-31 21:15:18 +02:00
Michael Litvak
e7dbccd59e cdc: use chunked_vector instead of vector for stream ids
use utils::chunked_vector instead of std::vector to store cdc stream
sets for tablets.

a cdc stream set usually represents all streams for a specific table and
timestamp, and has a stream id per each tablet of the table. each stream
id is represented by 16 bytes. thus the vector could require quite large
contiguous allocations for a table that has many tablets. change it to
chunked_vector to avoid large contiguous allocations.

Fixes scylladb/scylladb#26791

Closes scylladb/scylladb#26792
2025-10-31 13:02:34 +01:00
Nikos Dragazis
2fc812a1b9 db/config: Change default SSTable compressor to LZ4WithDictsCompressor
`sstable_compression_user_table_options` allows configuring a
node-global SSTable compression algorithm for user tables via
scylla.yaml. The current default is `LZ4Compressor` (inherited from
Cassandra).

Make `LZ4WithDictsCompressor` the new default. Metrics from real datasets
in the field have shown significant improvements in compression ratios.

If the dictionary compression feature is not enabled in the cluster
(e.g., during an upgrade), fall back to the `LZ4Compressor`. Once the
feature is enabled, flip the default back to the dictionary compressor
using with a listener callback.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-30 15:53:49 +02:00
Nikos Dragazis
96e727d7b9 db/config: Deprecate sstable_compression_dictionaries_allow_in_ddl
The option is a knob that allows to reject dictionary-aware compressors
in the validation stage of CREATE/ALTER statements, and in the
validation of `sstable_compression_user_table_options`. It was
introduced in 7d26d3c7cb to allow the admins of Scylla Cloud to
selectively enable it in certain clusters. For more details, check:
https://github.com/scylladb/scylla-enterprise/issues/5435

As of this series, we want to start offering dictionary compression as
the default option in all clusters, i.e., treat it as a generally
available feature. This makes the knob redundant.

Additionally, making dictionary compression the default choice in
`sstable_compression_user_table_options` creates an awkward dependency
with the knob (disabling the knob should cause
`sstable_compression_user_table_options` to fall back to a non-dict
compressor as default). That may not be very clear to the end user.

For these reasons, mark the option as "Deprecated", remove all relevant
tests, and adjust the business logic as if dictionary compression is
always available.

Signed-off-by: Nikos Dragazis <nikolaos.dragazis@scylladb.com>
2025-10-29 20:13:08 +02:00
Piotr Dulikowski
aba922ea65 Merge 'cdc: improve cdc metadata loading' from Michael Litvak
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes https://github.com/scylladb/scylladb/issues/26732

backport to 2025.4 where cdc with tablets is introduced

Closes scylladb/scylladb#26160

* github.com:scylladb/scylladb:
  test: cdc: extend cdc with tablets tests
  cdc: improve cdc metadata loading
2025-10-29 11:07:48 +01:00
Nadav Har'El
51186b2f2c alternator: add alternator_warn_authorization config
Before this patch, the configuration alternator_enforce_authorization
is a boolean: true means enforce authentication checks (i.e., each
request is signed by a valid user) and authorization checks (the user
who signed the request is allowed by RBAC to perform this request).

This patch adds a second boolean configuration option,
alternator_warn_authorization. When alternator_enforce_authorization
is false but alternator_warn_authorization is true, authentication and
authorization checks are performed as in enforce mode, but failures
are ignored and counted in two new metrics:

    scylla_alternator_authentication_failures
    scylla_alternator_authorization_failures

additionally,also each authentication or authorization error is logged as
a WARN-level log message. Some users prefer those log messages over
metrics, as the log messages contain additional information about the
failure that can be useful - such as the address of the misconfigured
client, or the username attempted in the request.

All combinations of the two configuration options are allowed:
 * If just "enforce" is true, auth failures cause a request failure.
   The failures are counted, but not logged.
 * If both "enforce" and "warn" are true, auth failures cause a request
   failure. The failures are both counted and logged.
 * If just "warn" is true, auth failures are ignored (the request
   is allowed to compelete) but are counted and logged.
 * If neither "enforce" nor "warn" are true, no authentication or
   authorization check are done at all. So we don't know about failures,
   so naturally we don't count them and don't log them.

This patch is fairly straightforward, doing mainly the following
things:

1. Add an alternator_warn_authorization config parameter.

2. Make sure alternator_enforce_authorization is live-updatable (we'll
   use this in a test in the next patch). It "almost" was, but a typo
   prevented the live update from working properly.

3. Add the two new metrics, and increment them in every type of
   authentication or authorization error.
   Some code that needs to increment these new metrics didn't have
   access to the "stats" object, so we had to pass it around more.

4. Add log messages when alternator_warn_authorization is true.

5. If alternator_enforce_authorization is false, allow the auth check
   to allow the request to proceed (after having counted and/or logged
   the auth error).

A separate patch will follow and add documentation suggesting to users
how to use the new "warn" options to safely switch between non-enforcing
to enforcing mode. Another patch will add tests for the new configuration
options, new metrics and new log messages.

Fixes #25308.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>
2025-10-29 11:16:26 +02:00
Pavel Emelyanov
e99c8eee08 commitlog: Remove unused work::r stream variable
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:46:29 +03:00
Botond Dénes
ac618a53f4 Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

Closes scylladb/scylladb#26319

* github.com:scylladb/scylladb:
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2025-10-28 14:52:59 +02:00
Botond Dénes
f3cec5f11a Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.

Implementation strategy:

1. Move code responsible for producing the schema
   of a secondary index to the file that handles
   `CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.

Fixes scylladb/scylladb#26542

Backport: we can discuss it.

Closes scylladb/scylladb#26543

* github.com:scylladb/scylladb:
  index: Set tombstone_gc when creating secondary index
  index: Make `create_view_for_index` method of `create_index_statement`
  index: Move code for creating MV of secondary index to cql3
  db, cql3: Move creation of underlying MV for index
2025-10-28 14:42:42 +02:00
Radosław Cybulski
ea6b22f461 Add max trace size output configuration variable
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
 #26634 will help.

- add configuration varable `alternator_max_users_query_size_in_trace_output`
   with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
  variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
  size, previously trunctation would also happen when data arrived in
  more than one chunk
- update `truncated_content_view` to better handle truncated value
  (limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
  name that is not existing will return `Items` array empty
  (but present) - this would raise array access exception few lines
  below.
- add test

Refs #26634
Refs #24031

Closes scylladb/scylladb#26618
2025-10-28 13:29:15 +03:00
Avi Kivity
d81796cae3 Merge 'Limit concurrent view updates from all sources' from Wojciech Mitros
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

Fixes https://github.com/scylladb/scylladb/issues/25341

Closes scylladb/scylladb#25456

* github.com:scylladb/scylladb:
  mv: limit concurrent view updates from all sources
  database: rename _view_update_concurrency_sem to _view_update_memory_sem
2025-10-28 11:13:24 +02:00
Michael Litvak
8743422241 cdc: improve cdc metadata loading
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.

the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.

Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.

We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.

This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.

Fixes scylladb/scylladb#26732
2025-10-28 08:54:09 +01:00
Wojciech Mitros
f07a86d16e mv: limit concurrent view updates from all sources
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.

The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.

Fixes https://github.com/scylladb/scylladb/issues/25341
2025-10-27 18:55:41 +01:00
Aleksandra Martyniuk
6fc43f27d0 db: fix indentation 2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk
e1b2180092 db/batchlog_manager: fix making decision to skip batch replay
Currently, we skip batch replay if less than batch_log_timeout passed
from the moment the batch was written. batch_log_timeout value can
be configured. If it is large, it won't be replayed for a long time.
If the tombstone will be GC'd before the batch is replayed, then we
risk the data resurrection.

To ensure safety we can skip only the batches that won't be GC'd.
In this patch we skip replay of the batches for which:
    now() < written_at + min(timeout + propagation_delay)

repair_time is set as a start of batchlog replay, so at the moment
of the check we will have:
    repair_time <= now()

So we know that:
    repair_time < written_at + propagation_delay

With this condition we are sure that GC won't happen.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
7f20b66eff db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
904183734f db/batchlog_manager: delete batch with incorrect or unknown version
batchlog_manager::replay_all_failed_batches skips batches that have
unknown or incorrect version. Next round will process these batches
again.

Such batches will probably be skipped everytime, so there is no point
in keeping them. Even if at some point the version becomes correct,
we should not replay the batch - it might be old and this may lead
to data resurrection.
2025-10-23 10:38:31 +02:00
Aleksandra Martyniuk
502b03dbc6 db/batchlog_manager: coroutinize replay_all_failed_batches 2025-10-23 10:38:31 +02:00
Wojciech Mitros
c0d0f8f85b database: rename _view_update_concurrency_sem to _view_update_memory_sem
In the following commit, we'll introduce a new semaphore for view updates
that limits their concurrency by view update count. To avoid confusion,
we rename the existing semaphore that tracks the memory used by concurrent
view updates and related objects accordingly.
2025-10-23 10:00:15 +02:00
Michael Litvak
6e2513c4d2 db: schema_applier: create schema with pointer to CDC schema
When creating a schema for a non-CDC table in the schema_applier, find
its CDC schema that we created previously in the same operation, if any,
and create the schema with a pointer to the CDC schema.

We use the fact that for a base table with CDC enabled, its CDC schema
is created or altered together in the same group0 operation.

Similarly, in schema_tables, when creating table schemas from the
schema tables, first create all schemas that don't have CDC enabled,
then create schemas that have CDC enabled by extending them with the
pointer to the CDC schema that we created before.

There are few additional cases where we create schemas that we need to
consider how to handle.

When loading a schema from schema tables in the schema_loader we decide
not to set the CDC schema, because this schema is mostly used for tools
and it's not used for generating CDC mutations.

When transporting a schema by RPC in the migration manager, we don't
transport its CDC schema, and we always set it to null. Because we use
raft we expect this shouldn't have any effect, because the schema is
synchronized through raft and not through the RPC.
2025-10-21 14:13:43 +02:00
Michael Litvak
4fe13c04a9 db: schema_applier: extract cdc tables
Previously in the schema applier we have two maps of schema_mutations,
for tables and for views. Now create another map for CDC tables by
extracting them from the non-views tables map.

We maintain the previous behavior by applying each operation that's done
on the tables map, to the CDC map as well.

Later we will want to handle CDC and non-CDC tables differently. We want
to be able to create all CDC schemas first, so when we create the
non-CDC tables we can create them with a pointer to their CDC schemas.
2025-10-21 14:13:43 +02:00
Michael Litvak
ac96e40f13 schema: add pointer to CDC schema
Add to the schema object a member that points to the CDC schema object
that is compatible with this schema, if any.

The compatible CDC schema is created and altered with its base schema in
the same group0 operation.

When generating CDC log mutations for some base mutation we want them to
be created using a compatible schema thas has a CDC column corresponding
to each base column. This change will allow us to find the right CDC
schema given a base mutation.

We also update the relevant structures in the schema registry that are
related to learning about schemas and transporting schemas across
shards or nodes.

When transporting a schema as frozen_schema, we need to transport the
frozen cdc schema as well, and set it again when unfreezing and
reconstructing the schema.

When adding a schema to the registry, we need to ensure its CDC schema
is added to the registry as well.

Currently we always set the CDC schema to nullptr and maintain the
previous behavior. We will change it in a later commit. Until then, we
mark all places where CDC schema is passed clearly so we don't forget
it.
2025-10-21 14:13:43 +02:00
Michael Litvak
154d5c40c8 frozen_schema: rename frozen_schema_with_base_info to extended_frozen_schema
This commit starts a series of refactoring commits of the frozen_schema
to reduce duplication and make it easier to extend.

Currently there are two essentially identical types,
frozen_schema_with_base_info and view_schema_and_base_info in the
schema_registry that hold a frozen_schema together with a base_info for
view schemas.

Their role is to pass around a frozen schema together with additional
info that is extracted from the schema and passed around with it when
transporting it across shards or nodes, and is needed for
reconstructing it, and it is not part of the schema mutations.

Our goal is to combine them to a single type that we will call
extended_frozen_schema.
2025-10-21 14:13:42 +02:00
Tomasz Grabiec
ba692d1805 schema_tables: Keep "replication" column backwards-compatible by expanding rack lists to numeric RF
In 380f243986 we added support for rack
lists in replication options. Drivers which are not prepared to parse
that (as of now, all of them), will not create metadata object for
that keyspace. This breaks, for example, the "copy to/from" cqlsh
command. Potentially other things too.

To fix that, keep the "replication" column in the old format, and
store numeric RF there, which corresponds to the number of
replicas. Accurate options in the new format are put in
"replication_v2".

We set replication_v2 in the schema only when it differs from the old
"replication" so that the new column is not set during upgrade,
otherwise downgrade would fail. Partition tombstone is added to ensure
that pre-alter replication_v2 value is deleted on alters which change
replication to a value which is the same as the post-alter
"replication" value.

Fixes #26415

Closes scylladb/scylladb#26429
2025-10-21 09:11:25 +03:00
Piotr Dulikowski
f76917956c view_building_worker: access tablet map through erm on sstable discovery
Currently, the data returned by `database::get_tables_metadata()` and
`database::get_token_metadata()` may not be consistent. Specifically,
the tables metadata may contain some tablet-based tables before their
tablet maps appear in the token metadata. This is going to be fixed
after issue scylladb/scylladb#24414 is closed, but for the time being
work around it by accessing the token metadata via
`table`->effective_replication_map() - that token metadata is guaranteed
to have the tablet map of the `table`.

Fixes: scylladb/scylladb#26403

Closes scylladb/scylladb#26588
2025-10-21 00:14:39 +02:00
Dawid Mędrek
20761b5f13 db, cql3: Move creation of underlying MV for index
The main goal of this patch is to give more control over the creation
of the underlying view on an index to `create_index_statement.cc`.
That goal is in line with how the other statements are executed:
the schema is built in the cql3 module and only the ready schema_ptr
is passed further. That should also make the code cleaner and easier
to understand.

There are a few important things to note here:

* A call to `service::prepare_new_view_announcement` appears out of nowhere.
  Aside from some validation checks and logging, that function does pretty
  much the same as the pre-existing code we remove:

  a. It creates Raft mutations based on the passed `view_ptr`.
  b. It creates Raft mutations responsible for view building tasks.
  c. It notifies about a new column family.

* We seemingly get rid of the code that creates view building tasks. That's not
  true: we still do that via `service::prepare_new_view_announcement`.

That should explain why the change doesn't remove any relevant logic.
On the other hand, it might be more difficult to explain why moving the
code is correct. I'll touch on it below.

Before that, it may also be important to highlight that this commit only
affects the logic responsible for creating an index. There should be no
effect on any other part of how Scylla behaves.

---

Proving the correctness of the solution would take quite a lot of space,
so I'll only summarize it. It relies on a few things:

1. Two schema changes cannot happen in one operation. We allow for more
   but only when those changes are dependent on each other and when
   the additional ones are internal for Scylla, e.g. creating an index
   leads to creating the underlying materialized view.
2. There are no entities or components that rely on indexes.
3. Each index is uniquely defined by the keyspace it belongs to
   and the name of the index.
4. There is a bijection between rows in `system_schema.indexes`
   and the currently existing indexes.
5. The name of an unnamed index depends on the name of the base table
   and the names of the indexed columns. The name of an unnamed index
   may have a number attached to it, but that number only depends on
   the state of the schema at the time of creation of the index, and
   it never changes later on. There are no other things the name of
   an unnamed index depends on.
6. Scylla doesn't allow for changing any column in the base table
   that has an index depending on it.

Based on that, we conclude that every existing index has exactly one
entry in `system_schema.indexes`, and the primary key of that entry
never changes.

The columns of `system_schema.indexes` that are not part of the primary
key are: `kind` and `options`. Both values are only decided at the time
of creation of an index, and currently there's no way to modify them.

That implies that there are only two events when an entry in the system
table can change: when creating an index and when dropping an index.

---

When we consider the previous place of the logic that this commit moves
to `cql3/statements/create_index_statement.cc`, it works like this:

1. We compare the sets of indexes defined on a specific table
   (in the form of a structure called `index_metadata`) before and
   after an operation.
2. We divide the entries into three sets: those present in both sets
   and those present in only one of them.
3. We handle each of those three sets separately.

The structure `index_metadata` is a reflection of entries in
`system_schema.indexes`. It stores one more parameter -- `local` --
but its value depends on the other values of an entry, so we can ignore
it in this reasoning.

Because an index cannot be modified -- it can only be created or dropped
-- there are at most two non-empty sets: the set of new indexes and the
set of dropped indexes. Those sets are only non-empty during an operation
like `CREATE INDEX`, `DROP INDEX`, `DROP TABLE (base table)`,
`DROP KEYSPACE`. Note that it's impossible to drop an index by dropping
the underlying materialized view -- Scylla doesn't allow for that.

However, the code in `migration_manager.cc` we call
(`prepare_column_family_update_announcement`) and the code that we call
in `schema_tables.cc` (`make_update_table_mutations`) is only triggered
by *updates* related to the base table. In the context of `DROP TABLE`
or `DROP KEYSPACE`, we'd call `prepare_column_family_drop_announcement`
instead. In other words, we're only concerned with `CREATE INDEX` and
`DROP INDEX`.

---

A conclusion from this reasoning is that we only need to consider those
two situations when talking about correctness of this change. The impact
of this commit is that we may have potentially reordered mutations in the
resulting vector that will be applied to the Raft log.

The only mutations we may have reordered are the mutations responsible for
creating the underlying view and the mutations responsible for updating
columns in the base table. It's clear then that this commit brings no change
at all: we only give `cql3/statements/create_index_statement.cc` more
control over creating the underlying view.

---

We leave a remnant of the code in `db/schema_tables.cc` responsible
for dropping an index along with its underlying view. It would require
changing a bit more of the logic, and we don't need it for the rest
of this sequence of changes.

Refs scylladb/scylladb#16454
2025-10-20 14:04:06 +02:00
Pavel Emelyanov
44ed3bbb7c Merge 'RFC: Initial GCP storage backend for scylla (sstables + backup)' from Calle Wilund
Integrates GCP object storage as a working storage backend for scylla sstables as well as backup storage.

Adds an abstraction layer (atm very heavily designed around the s3 client interface and usage) to allow the "storage" etc layers of sstable management to pick transparently between "s3" and "gs" providers.

This modifies the scylla config such that endpoints can optionally (through a "type" param) ref a GS backend.
Similarly with storage_options.

Also adds some IO wrapping primitives to make it more feasible to place some logic at a mid level of the implementation stack (such as making networked storage files, ranged reading etc).

Test s3 fixture is replaced (where appropriate) with an `object_storage` fixture that multiplexes the test across both backends.
Unit tests are duplicated and for the GS versions use a boost test fixture for GCS, default local fake.

Fixes #25359
Fixes #26453

Closes scylladb/scylladb#26186

* github.com:scylladb/scylladb:
  docs::dev::object_storage: Add some initial info on GS storage
  docs/dev: Add mention of (nested) docker usage in testing.md
  sstables::object_storage_client: Forward memory limit semaphore to GS instance
  utils::gcp::object_storage: Add optional memory limits to up/download
  sstables::object_storage_client: Add multi-upload support for GS
  utils::gcp::storage: Add merge objects operation
  test_backup/test_basic: Make tests multiplex both s3 and gs backends
  test::cluster::conftest: Add support for multiple object storage backends
  boost::gcs_storage_test: reindent
  boost::gcs_storage_test: Convert to use fixture
  tests::boost: Add GS object storage cases to mirror S3 ones
  tests::lib::gcs_fixture: Add a reusable test fixture for real/fake GS/GCS
  tests::lib::test_utils: Add overloads/helpers for reading and (temp) writing env
  sstables::object_storage_client: Add google storage implementation
  test_services: Allow testing with GS object storage parameters
  utils::gcp::gcp_credentials: Add option to create uninitialized credentials
  utils::gcp::object_storage: Make create_download_source return seekable_data_source
  utils::gcp::object_storage: Add defensive copies of string_view params
  utils::gcp::object_storage: Add missing retry backoff increate
  utils::gcp::object_storage: Add timestamp to object listing
  utils::gcp::object_storage: Add paging support to list_objects
  object_storage_client: Add object_name wrapper type
  utils::gcp::object_storage: Add optional abort_source
  utils::rest::client: Add abort_source support
  sstables: Use object_storage_client for remote storage
  sstables::object_storage_client: Add abstraction layer for OS cliens (s3 initial)
  s3::upload_progress: Promote to general util type
  storage_options: Abstract s3 to "object_storage" and add gs as option
  sstables::file_io_extension: Change "creator" callback to just data_source
  utils::io-wrappers: Add ranged data_source
  utils::io-wrappers: Add file wrapper type for seekable_source
  utils::seekable_source: Add a seekable IO source type
  object_storage_endpoint_param: Add gs storage as option
  config: break out object_storage_endpoint_param preparing for multi storage
2025-10-20 13:14:53 +03:00
Tomasz Grabiec
c4a87453a2 Merge 'Add experimental feature flag for strongly consistent tables and extend kesypace creation syntax to allow specifying consistency mode.' from Gleb Natapov
The series adds an experimental flag for strongly consistent tables  and extends "CREATE KEYSPACE" ddl with `consistency` option that allows specifying the consistency mode for the keyspace.

Closes scylladb/scylladb#26116

* github.com:scylladb/scylladb:
  schema: Allow configuring consistency setting for a keyspace
  db: experimental consistent-tablets option
2025-10-16 21:48:06 +02:00
Tomasz Grabiec
e6c427953e Merge 'schema_applier: unify handling of token_metadata during schema change' from Marcin Maliszkiewicz
This patchset improves the atomicity and clarity of schema application in
the presence of token metadata updates during schema changes. The primary
focus is to ensure that changes to tablet metadata are applied atomically
as part of the schema commit phase, rather than being replicated to all
cores afterward, which previously violated atomicity guarantees.

Key changes:

- Introduced pending_token_metadata to unify handling of new and existing metadata.
- Split token metadata replication into prepare and commit steps.
- Abstracted schema dependencies in storage_service to support pending schema visibility.
- Applied tablet metadata updates atomically within schema commit phase.

Backport: no, it's a new feature
Fixes: https://github.com/scylladb/scylladb/issues/24414

Closes scylladb/scylladb#25302

* github.com:scylladb/scylladb:
  db: schema_applier: update tablet metadata atomically
  db: replica: move tables_metadata locking to commit
  storage_service: abstract schema dependecies during token metadata update
  storage_service: split replicate_to_all_cores to steps
  db: schema_applier: unify token_metadata loading
  replica: schema_applier: obtain copy of token_metadata at the beginning of schema merge
  service: fix dependencies during migration_manager startup
  db: schema_applier: move pending_token_metadata to locator
  db: always use _tablet_hint as condition for tablet metadata change
  db: refactor new_token_metadata into pending_token_metadata
  db: rename new_token_metadata to pending_token_metadata
  db: schema_applier: move types storage init to merge_types func
  db: schema_applier: make merge functions non-static members
  db: remove unused proxy from create_keyspace_metadata
2025-10-16 21:43:49 +02:00
Gleb Natapov
c255740989 schema: Allow configuring consistency setting for a keyspace
We want to add strongly consistent tables as an option. We will have
two kind of strongly consistent tables: globally consistent and locally
consistent. The former means that requests from all DCs will be globally
linearisable while the later - only requests to the same DCs will be
linearisable.  To allow configuring all the possibilities the patch
adds new parameter to a keyspace definition "consistency" that can be
configured to be `eventual`, `global` or `local`. Non eventual setting
is supported for tablets enabled keyspaces only. Since we want to start
with implementing local consistency configuring global consistency will
result in an error for now.
2025-10-16 13:34:49 +03:00