Commit Graph

11801 Commits

Author SHA1 Message Date
Petr Gusev
b6bcd062de test_automatic_cleanup: fix comment 2025-10-28 17:55:20 +01:00
Dawid Mędrek
5e03b01107 test/cluster: Add test_simulate_upgrade_legacy_to_raft_listener_registration
We provide a reproducer test of the bug described in
scylladb/scylladb#18049. It should fail before the fix introduced in
scylladb/scylladb@7ea6e1ec0a, and it
should succeed after it.

Refs scylladb/scylladb#18049
Fixes scylladb/scylladb#18071

Closes scylladb/scylladb#26621
2025-10-28 17:32:15 +01:00
Pavel Emelyanov
37b9cccc1c test: Don't reuse on-stack input stream
The test consists of several snippets, each creating an input_stream for
some short operation and checking the result. Each snipped over-writes
the local `input_stream in` variable with the new one.

This change wraps each of those snippets into own code block in order to
have own new `input_stream in` variable in each.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
2025-10-28 19:25:07 +03:00
Yauheni Khatsianevich
99dc31e71a tests(lwt): new test for LWT testing during tablet resize
– Workload: N workers perform CAS updates
 UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
 at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
 as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
 flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.

Closes scylladb/scylladb#26113
2025-10-28 16:48:57 +01:00
Michael Litvak
4cc0a80b79 test: cdc: extend cdc with tablets tests
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
2025-10-28 15:06:21 +01:00
Pavel Emelyanov
948cefa5f9 test: Extend API consistency test with tokens_endpoint endpoint
Recently (#26231) there was added a test to check that several API
endpoints, that return tokens and corresponding replica nodes, are
consistent with tablet map. This patch adds one more API endpoint to the
validation -- the /storage_service/tokens_endpoint one.

The extention is pretty straightforward, but the new endpoint returns
back a single (primary) replica for a token, so the test check is
slightly modified to account for that.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26580
2025-10-28 15:18:09 +02:00
Dawid Mędrek
535d31b588 test/cluster/random_failures: Re-enable index events
We've enabled the configuration option `rf_rack_valid_keyspaces`,
so we can finally re-enable the events creating and dropping secondary
indexes.

Fixes scylladb/scylladb#26422
2025-10-28 14:17:14 +01:00
Dawid Mędrek
b4898e50bf test/cluster/random_failures: Enable rf_rack_valid_keyspaces
Now that the test has been adjusted to work with the configuration
option, we enable it.
2025-10-28 14:17:09 +01:00
Dawid Mędrek
59b2a41c49 test/cluster/random_failures: Adjust to RF-rack-validity
We adjust the test to work with the configuration option
`rf_rack_valid_keyspaces` enabled. For that, we ensure that there is
always at least one node in each of the three racks. This way, all
keyspaces we create and manipulate will remain RF-rack-valid since they
all use RF=3.

------------------------------------------------------------------------

To achieve that, we only need to adjust the following events:

1. `init_tablet_transfer`
   The event creates a new keyspace and table and manually migrates
   a tablet belonging to it. As long as we make sure the migration occurs
   within the same rack, there will be no problem.

   Since RF == #racks, each rack will have exactly one tablet replica,
   so we can migrate the tablet to an arbitrary node in the same rack.

   Note that there must exist a node that's not a replica. If there weren't
   such a node, the test wouldn't have worked before this commit because
   it's not possible to migrate a tablet from one node being its replica to
   another. In other words, we have a guarantee that there are at least 4 nodes
   in the cluster when we try to migrate a tablet replica.

   That said, we check it anyway. If there's no viable node to migrate the
   tablet replica to, we log that information and do nothing. That should be
   an acceptable solution.

2. `add_new_node`
   As long as we add a node to an existing rack, there's no way to
   violate the invariant imposed by the configuration option, so we pick
   a random rack out of the existing three and create a node in it.

3. `decommission_node`
   We need to ensure that the node we'll be trying to decommission is
   not the only one in its rack.

   Following pretty much the same reasoning as in `init_tablet_transfer`,
   we conclude there must be a rack with at least two nodes in it. Otherwise
   we'd end up having to migrate a tablet from one replica node to another,
   which is not possible.

   What's more, decommissioning a node is not possible if any node in
   the cluster is dead, so we can assume that `manager.running_servers`
   returns the whole cluster.

4. `remove_node`
   The same as `decommission_node`. Just note although the node we choose to
   remove must be first stopped, none other node can be dead, so the whole
   cluster must be returned by `manager.running_servers`.

------------------------------------------------------------------------

There's one more important thing to note. The test may sometimes trigger
a sequence of events where a new node is started, but, due to an error
injection, its initialization is not completed. Among other things, the
node may NOT have a host ID recognized by the rest of the nodes in the
cluster, and operations like tablet migration will fail if they target
it.

Thankfully, there seems to be a way to avoid problems stemming from
that. When a new node is added to the cluster, it should appear at the
end of the list returned by `manager.running_servers`. This most likely
stems from how dictionaries work in Python:

"Keys and values are iterated over in insertion order."
-- https://docs.python.org/3/library/stdtypes.html#dict-views

and the fact that we keep track of running servers using a dictionary.

Furthermore, we rely on the assumption that the test currently works
correctly.

Assume, to the contrary, that among the nodes taking part in the operations
listed above, there is at most one node per rack that has its host ID recognized
by the rest of the cluster. Note that only those nodes can store any tablets.
Let's refer to the set of those nodes as X.

Assume that we're dealing with tablet migration, decommissioning, or removing
a node. Since those operations involve tablet migration, at least one tablet
will need to be migrated from the node in question to another node in X.
However, since X consists of at most three nodes, and one of them is losing
its tablet, there is no viable target for the tablet, so the operation fails.

Using those assumptions, an auxiliary function, `select_viable_rack`,
was designed to carefully choose a correct rack, which we'll then pick nodes
from to perform the topological operations. It's simple: we just find the first
rack in the list that has at least two nodes in it. That should ensure that we
perform an operation that doesn't lead to any unforeseen disaster.

------------------------------------------------------------------------

Since the test effectively becomes more complex due to more care for keeping
the topology of the cluster valid, we extend the log messages to make them
more helpful when debugging a failure.
2025-10-28 14:15:57 +01:00
Nadav Har'El
87573197d4 test/alternator: reproducers for missing headers and request limit
This patch adds reproducing tests in test/alternator for issue #23438,
which is about missing checks for the length of headers and the URL
in Alternator requests. These should be limited, because Seastar's
HTTP server, which Scylla uses, reads them into memory so they can OOM
Scylla.

The tests demonstrate that DynamoDB enforces a 16 KB limit on the
headers and the URL of the request, but Scylla doesn't (a code
inspection suggests it does not in fact have any limit).

The two tests pass on DynamoDB and currently xfail on Alternator.

Refs #23438.

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23442
2025-10-28 15:12:25 +02:00
Pavel Emelyanov
d9bfbeda9a lister: Fix race between readdir and stat
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.

That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.

Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>

Closes scylladb/scylladb#26595
2025-10-28 15:10:22 +02:00
Botond Dénes
ac618a53f4 Merge 'db: repair: do not update repair_time if batchlog replay failed' from Aleksandra Martyniuk
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.

Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
    - batchlog reply fails;
    - repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!

Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.

Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.

Fixes: https://github.com/scylladb/scylladb/issues/24415

Data resurrection fix; needs backport to all versions

Closes scylladb/scylladb#26319

* github.com:scylladb/scylladb:
  db: fix indentation
  test: add reproducer for data resurrection
  repair: fail tablet repair if any batch wasn't sent successfully
  db/batchlog_manager: fix making decision to skip batch replay
  db: repair: throw if replay fails
  db/batchlog_manager: delete batch with incorrect or unknown version
  db/batchlog_manager: coroutinize replay_all_failed_batches
2025-10-28 14:52:59 +02:00
Botond Dénes
f3cec5f11a Merge 'index: Set tombstone_gc when creating underlying view' from Dawid Mędrek
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.

Implementation strategy:

1. Move code responsible for producing the schema
   of a secondary index to the file that handles
   `CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.

Fixes scylladb/scylladb#26542

Backport: we can discuss it.

Closes scylladb/scylladb#26543

* github.com:scylladb/scylladb:
  index: Set tombstone_gc when creating secondary index
  index: Make `create_view_for_index` method of `create_index_statement`
  index: Move code for creating MV of secondary index to cql3
  db, cql3: Move creation of underlying MV for index
2025-10-28 14:42:42 +02:00
Nadav Har'El
c3593462a4 alternator: improve protection against oversized requests
Following DynamoDB, Alternator also places a 16 MB limit on the size of
a request. Such a limit is necessary to avoid running out of memory -
because the AWS message authentication protocol requires reading the
entire request into memory before its signature can be verified.

Our implementation for this limit used Seastar's HTTP server's
content_length_limit feature. However, this Seastar feature is
incomplete - it only works when the request uses the Content-Length
header, and doesn't do anything if the request doesn't have a
Content-Length (it may use chunked encoding, or have no length at all).
So malicious users can cause Scylla to OOM by sending a huge request
without a Content-Length.

So in this patch we stop using the incomplete Seastar feature, and
implement the length limit in Scylla in a way that works correctly with
or without Content-Length: We read from the input stream and if we go
over 16MB, we generate an error.

Because we dropped Seastar's protection against a long Content-Length,
we also need to fix a piece of code which used Content-Length to reserve
some semaphore units to prevent reading many large requests in parallel.
We fix two problems in the code:
1. If Content-Length is over the limit, we shouldn't attempt to reserve
   semaphore units - this should just be a Payload Too Large error.
2. If Content-Length is missing, the existing code did nothing and had
   a TODO that we should. In this patch we implement what was suggested
   in that TODO: We temporarily reserve the whole 16 MB limit, and
   after reading the actual request, we return part of the reservation
   according to the real request size.

That last fix is important, because typically the largest requests will be
BatchWriteItem where a well-written client would want to use chunked
encoding, not Content-Length, to avoid materializing the entire request
up-front. For such clients, the memory use semaphore did nothing, and
now it does the right thing.

Note that this patch does *not* solve the problem #12166 that existed
with Seastar's length-limiting implementation but still exists in the
new in-Scylla length-limiting implementation: The fact we send an
error response in the middle of the request and then close the
connection, while the client continues to send the request, can lead
to an RST being sent by the server kernel. Usually this will be fine -
well-written client libraries will be able to read the response before
the RST. But even with a well-written library in some rare timings
the client may get the RST before the response, and will miss the
response, and get an empty or partial response or "connection reset
by peer". This issue existed before this patch, and still exists, but
is probably of minor impact.

Fixes #8196

Signed-off-by: Nadav Har'El <nyh@scylladb.com>

Closes scylladb/scylladb#23434
2025-10-28 15:24:46 +03:00
Ferenc Szili
10f07fb95a load_balancer: load_stats reconcile after tablet migration and table resize
This change adds the ability to move tablets sizes in load_stats after a
tablet migration or table resize (split/merge). This is needed because
the size based load balancer needs to have tablet size data which is as
accurate as possible, in order to issue migrations which improve
load balance.
2025-10-28 12:12:09 +01:00
Radosław Cybulski
ea6b22f461 Add max trace size output configuration variable
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
 #26634 will help.

- add configuration varable `alternator_max_users_query_size_in_trace_output`
   with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
  variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
  size, previously trunctation would also happen when data arrived in
  more than one chunk
- update `truncated_content_view` to better handle truncated value
  (limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
  name that is not existing will return `Items` array empty
  (but present) - this would raise array access exception few lines
  below.
- add test

Refs #26634
Refs #24031

Closes scylladb/scylladb#26618
2025-10-28 13:29:15 +03:00
Pavel Emelyanov
ac1d709709 Merge 'tablet_sstable_streamer: defer SSTable unlinking until fully streamed' from Taras Veretilnyk
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.

test_tablets2::test_tablet_load_and_stream  was enhanced to also verify that SSTables are removed after being streamed.

Fixes #26606

Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.

Closes scylladb/scylladb#26622

* github.com:scylladb/scylladb:
  test_tablets2: verify SSTable cleanup after tablet load and stream
  tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
2025-10-28 13:25:40 +03:00
Patryk Jędrzejczak
5321720853 test: test_raft_recovery_stuck: reconnect driver after rolling restarts
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.

Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.

Fixes #19959

Closes scylladb/scylladb#26638
2025-10-28 13:24:23 +03:00
Pavel Emelyanov
54a117b19d Merge 'retry_strategy: Switch to using seastar's retry_strategy (take two)' from Ernest Zaslavsky
With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB.

Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible

ref: https://github.com/scylladb/seastar/issues/2803

depends on: https://github.com/scylladb/seastar/pull/2960

Closes scylladb/scylladb#25801

* github.com:scylladb/scylladb:
  s3_client: remove unnecessary `co_await` in `make_request`
  s3 cleanup: remove obsolete retry-related classes
  s3_client: remove unused `filler_exception`
  s3_client: fix indentation
  s3_client: simplify chunked download error handling using `make_request`
  s3_client: reformat `make_request` functions for readability
  s3_client: eliminate duplication in `make_request` by using overload
  s3_client: reformat `make_request` function declarations for readability
  s3_client: reorder `make_request` and helper declarations
  s3_client: add `make_request` override with custom retry and error handler
  s3_client: migrate s3_client to Seastar HTTP client
  s3_client: fix crash in `copy_s3_object` due to dangling stream
  s3_client: coroutinize `copy_s3_object` response callback
  aws_error: handle missing `unexpected_status_error` case
  s3_creds: use Seastar HTTP client with retry strategy
  retry_strategy: add exponential backoff to `default_aws_retry_strategy`
  retry_strategy: introduce Seastar-based retry strategy
  retry_strategy: update CMake and configure.py for new strategy
  retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy`
  retry_strategy: fix include
  retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh
  retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc
2025-10-28 13:08:42 +03:00
Taras Veretilnyk
1361ae7a0a test_tablets2: verify SSTable cleanup after tablet load and stream
Modify existing test_tablet_load_and_stream testcase to verify
that SSTable files are properly deleted from the upload
directory after streaming.
2025-10-27 16:36:08 +01:00
Piotr Dulikowski
fd966ec10d Merge 'cdc: garbage collect CDC streams for tablets' from Michael Litvak
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.

the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.

in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.

Fixes https://github.com/scylladb/scylladb/issues/26669

Closes scylladb/scylladb#26410

* github.com:scylladb/scylladb:
  cdc: garbage collect CDC streams periodically
  cdc: helpers for garbage collecting old streams for tablets
2025-10-27 16:16:55 +01:00
Michał Hudobski
541b52cdbf cql: fail with a better error when null vector is passed to ann query
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").

This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.

Fixes: VECTOR-257

Closes scylladb/scylladb#26510
2025-10-27 16:09:08 +02:00
Botond Dénes
417270b726 Merge 'Port dtest EAR tests to test.py/pytest in scylla CI' from Calle Wilund
Fixes #26641

* Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman)
* Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing
* Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest.
* Shared KMIP mock between boost test and pytest and speeds up boost test shutdown.

When merged, the dtest counterpart can be decommissioned.

Closes scylladb/scylladb#26642

* github.com:scylladb/scylladb:
  test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
  test::cluster::test_encryption: Port dtest EAR tests
  test::cluster::conftest: Add key_provider fixture
  test::pylib::encryption_provider: Port dtest encryption provider classes
  test::pylib::dockerized_service: Add helper for running docker/podman
  test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
  test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
2025-10-27 15:42:52 +02:00
Patryk Jędrzejczak
e1c3f666c9 Merge 'vnode cleanup: add missing barriers and fix race conditions' from Petr Gusev
Problems addressed by this PR

* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
  * The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
  * Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
  * It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
  * It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.

Fixes scylladb/scylladb#26150

backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting

Closes scylladb/scylladb#26315

* https://github.com/scylladb/scylladb:
  test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
  test_fencing: fix due to new version increment
  test_automatic_cleanup: clean it up
  storage_proxy: wait for closing sessions in sstable cleanup fiber
  storage_proxy: rename await_pending_writes -> await_stale_pending_writes
  storage_proxy: use run_fenceable_write
  storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
  storage_proxy: introduce run_fenceable_write
  storage_proxy: move update_fence_version from shared_token_metadata
  storage_proxy: fix start_write() operation scope in apply_locally
  storage_proxy: move post fence check into handle_write
  storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
  storage_proxy::handle_read: add fence check before get_schema
  storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
  sstable_cleanup_fiber: use coroutine::parallel_for_each
  storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
  topology_coordinator: barrier before cleanup
  topology_coordinator: small start_cleanup refactoring
  global_token_metadata_barrier: add fenced flag
2025-10-27 12:35:13 +01:00
Michael Litvak
6109cb66be cdc: garbage collect CDC streams periodically
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
2025-10-26 11:01:20 +01:00
Michael Litvak
440caeabcb cdc: helpers for garbage collecting old streams for tablets
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.

- get_new_base_for_gc: finds a new base timestamp given a TTL, such that
  all older timestamps and streams can be removed.
- get_cdc_stream_gc_mutations: given new base timestamp and streams,
  builds mutations that update the internal cdc tables and remove the
  older streams.
- garbage_collect_cdc_streams_for_table: combines the two functions
  above to find a new base and build mutations to update it for a
  specific table
- garbage_collect_cdc_streams: builds gc mutations for all cdc tables
2025-10-26 11:01:20 +01:00
Avi Kivity
b843d8bc8b Merge 'scylla-sstable: add cql support to write operation' from Botond Dénes
In theory, scylla-sstable write is an awesome and flexible tool to generate sstables with arbitrary content. This is convenient for tests and could come clutch in a disaster scenario, where certain system table's content need to be manually re-created, system tables that are not writable directly via CQL.
In practice, in its current form this operation is so convoluted to use that even its own author shuns it. This is because the JSON specification of the sstable content is the same as that of the scylla-sstable dump-data: containing every single piece of information on the mutation content. Where this is an advantage for dump-data, allowing users to inspect the data in its entirety -- it is a huge disadvantage for write, because of all these details have to be filled in, down to the last timestamp, to generate an sstable. On top of that, the tool doesn't even support any of the more advanced data types, like collections, UDF and counters.
This PR proposes a new way of generating sstables: based on the success of scylla-sstable query, it introduces CQL support for scylla-sstable write. The content of the sstable can now be specified via standard INSERT, UPDATE and DELETE statements, which are applied to a memtable, then flushed into the sstable.
To avoid boundless memory consumption, the memtable is flushed every time it reaches 1MiB in size, consequently the command can generate multiple output sstables.

The new CQL input-format is made default, this is safe as nobody is using this command anyway. Hopefully this PR will change that.

Fixes: https://github.com/scylladb/scylladb/issues/26506

New feature, no backport.

Closes scylladb/scylladb#26515

* github.com:scylladb/scylladb:
  test/cqlpy/test_tools.py: add test for scylla-sstable write --input-format=cql
  replica/mutation_dump: add support for virtual tables
  tools/scylla-sstable: print_query_results_json(): handle empty value buffer
  tools/scylla-sstable: add cql support to write operation
  tools/scylla-sstable: write_operation(): fix indentation
  tools/scylla-sstable: write_operation(): prepare for a new input-format
  tools/scylla-sstable: generalize query_operation_validate_query()
  tools/scylla-sstable: move query_operation_validate_query()
  tools/scylla-sstable: extract schema transformation from query operation
  replica/table: add virtual write hook to the other apply() overload too
2025-10-24 23:32:40 +03:00
Avi Kivity
997b52440e Merge 'replica/mutation_dump: include empty/dead partitions in the scan results' from Botond Dénes
`select * from mutation_fragment()` queries don't return partitions which are completely empty or only contain tombstones which are all garbage collectible. This is because the underlying `mutation_dump` mechanism has a separate query to discover partitions for scans. This query is a regular mutation scan, which is subject to query compaction and garbage collection. Disable the query compaction for mutation queries executed on behalf of mutation fragment queries, so *all* data is visible in the result, even that which is fully garbage collectible.

Fixes scylladb/scylladb#23707.

Scans for mutation-fragment are very rare, so a backport is not necessary. We can backport on-demand.

Closes scylladb/scylladb#26227

* github.com:scylladb/scylladb:
  replica/mutation_dump: multi_range_partition_generator: disable garbage-collection
  replica: add tombstone_gc_enabled parameter to mutation query methods
  mutation/mutation_compactor: remove _can_gc member
  tombstone_gc: add tombstone_gc_state factory methods for gc_all and no_gc
2025-10-24 23:26:16 +03:00
Patryk Jędrzejczak
5ae1aba107 test: unskip test_raft_recovery_entry_loss
The issue has been fixed in #26612.

Closes scylladb/scylladb#26614
2025-10-24 21:23:41 +03:00
Andrzej Jackowski
8642629e8e test: add test_anonymous_user to test_raft_service_levels
The primary goal of this test is to reproduce scylladb/scylladb#26040
so the fix (278019c328) can be backported
to older branches.

Scenario: connect via CQL as an anonymous user and verify that the
`sl:default` scheduling group is used. Before the fix for #26040
`main` scheduling group was incorrectly used instead of `sl:default`.

Control connections may legitimately use `sl:driver`, so the test
accepts those occurrences while still asserting that regular anonymous
queries use `sl:default`.

This adds explicit coverage on master. After scylladb#24411 was
implemented, some other tests started to fail when scylladb#26040
was unfixed. However, none of the tests asserted this exact behavior.

Refs: scylladb/scylladb#26040
Refs: scylladb/scylladb#26581

Closes scylladb/scylladb#26589
2025-10-24 12:23:34 +02:00
Ernest Zaslavsky
47704deb1e s3_client: simplify chunked download error handling using make_request
Refactor `chunked_download_source` to eliminate redundant exception
handling by leveraging the new `make_request` override with custom
retry strategy. This streamlines the download fiber logic, improving
readability and maintainability.
2025-10-23 15:58:11 +03:00
Ernest Zaslavsky
bdb3979456 s3_client: migrate s3_client to Seastar HTTP client
Eliminate use of `retryable_http_client` in `s3_client` and adopt
Seastar's native HTTP client.
2025-10-23 15:58:10 +03:00
Aleksandra Martyniuk
1935268a87 test: add reproducer for data resurrection
Add a reproducer to check that the repair_time isn't updated
if the batchlog replay fails.

If repair_time was updated, tombstones could be GC'd before the
batchlog is replayed. The replay could later cause the data
resurrection.
2025-10-23 10:39:43 +02:00
Aleksandra Martyniuk
7f20b66eff db: repair: throw if replay fails
Return a flag determining whether all the batches were sent successfully in
batchlog_manager::replay_all_failed_batches (batches skipped due to being
too fresh are not counted). Throw in repair_flush_hints_batchlog_handler
if not all batches were replayed, to ensure that repair_time isn't updated.
2025-10-23 10:38:31 +02:00
Botond Dénes
f8b0142983 Merge 'Add --drop-unfixable-sstables flag for scrub in segregate mode' from Taras Veretilnyk
This PR introduces support for a new scrub option: `--drop-unfixable-sstables`, which enables the dropping of corrupted SSTables during scrub only in segregate mode. The patch includes implementation, validation, and  set of tests to ensure correct behavior and error handling.

Fixes #19060

Backport is not required, it is a new feature

Closes scylladb/scylladb#26579

* github.com:scylladb/scylladb:
  sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
  test/nodetool: add scrub drop-unfixable-sstables option testcase
  scrub: add support for dropping unfixable sstables in segregate mode
2025-10-23 11:06:19 +03:00
Tomasz Grabiec
564cebd0e6 Merge 'tablet_metadata_guard: fix split/merge handling' from Petr Gusev
The guard should stop refreshing the ERM when the number of tablets changes. Tablet splits or merges invalidate the `tablet_id` field (`_tablet`), which means the guard can no longer correctly protect ongoing operations from tablet migrations.

The problem is specific to LWT, since `tablet_metadata_guard` is used mostly for heavy topology operations, which exclude with split and merge. The guard was used for LWT as an optimization -- we don't need to block topology operations or migrations of unrelated tablets. In the future, we could use the guard for regular reads/writes as well (via the `token_metadata_guard` wrapper).

Fixes [scylladb/scylladb#26437](https://github.com/scylladb/scylladb/issues/26437)

backports: need to backport to 2025.4 since the bug is relevant to LWT over tablets.

Closes scylladb/scylladb#26619

* github.com:scylladb/scylladb:
  test_tablets_lwt: add test_tablets_merge_waits_for_lwt
  test.py: add universalasync_typed_wrap
  tablet_metadata_guard: fix split/merge handling
  tablet_metadata_guard: add debug logs
  paxos_state: shards_for_writes: improve the error message
  storage_service: barrier_and_drain – change log level to info
  topology_coordinator: fix log message
2025-10-22 20:56:21 +02:00
Taras Veretilnyk
60334c6481 sstable_compaction_test: add segregate mode tests for drop-unfixable-sstables option
Added a new test case, sstable_scrub_segregate_mode_drop_unfixable_sstables_test,
which verifies that when the drop-unfixable-sstables flag is enabled in segregate
mode, corrupted SSTables are correctly dropped.
2025-10-22 17:16:55 +02:00
Taras Veretilnyk
11874755a3 test/nodetool: add scrub drop-unfixable-sstables option testcase
This patches introduces the test_scrub_drop_unfixable_sstables_option testcase,
which verifies that correct request is generated when the --drop-unfixable-sstables flag is used.
It also validates that an error is thrown if the drop-unfixable-sstables
flag is enabled and mode is not set to SEGREGATE.

This patch introduces test_scrub_drop_unfixable_sstables_option, which test
2025-10-22 17:16:55 +02:00
Taras Veretilnyk
42da7f1eb6 scrub: add support for dropping unfixable sstables in segregate mode
This patch adds a new flag `drop-unfixable-sstables` to the scrub operation
in segregate mode, allowing to automatically drop SSTables that
cannot be fixed during scrub. It also includes API support of the 'drop_unfixable_sstables'
paramater and validation to ensure this flag is not enabled in other modes rather than segragate.
2025-10-22 17:16:49 +02:00
Petr Gusev
22271b9fe7 test_automatic_cleanup: add test_cleanup_waits_for_stale_writes 2025-10-22 16:31:43 +02:00
Petr Gusev
d1fc111dd7 test_fencing: fix due to new version increment
Topology version is now bumped when a node finishes bootstrapping.
As a result, fence_version == version - 1, and decrementing version
in the test no longer triggers a stale topology exception.

Fix: run cleanup_all to invoke the global barrier, which synchronizes
fence_version := version on all nodes.
2025-10-22 16:31:43 +02:00
Petr Gusev
5bdeb4ec66 test_automatic_cleanup: clean it up
Remove redundant imports and variables. Extract cleanup_all
function. Add logs. Remove pytest.mark.prepare_3_racks_cluster --
the test doesn't actually need a 3 node cluster, one initial
node is enough.
2025-10-22 16:31:43 +02:00
Petr Gusev
263cbef68e storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
The function applies only to vnode-based tables. Rename it to
vnodes_cleanup_fiber for clarity.
2025-10-22 16:31:42 +02:00
Calle Wilund
c4427f6d4f test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
Use the shared logic of DockerizedServer to provide the fake-gcs-server
docker helper.
2025-10-22 14:06:30 +00:00
Calle Wilund
1aa8014f8f test::cluster::test_encryption: Port dtest EAR tests
Moves the tests, reexamined and simplified, to unit tests
instead of dtest.
2025-10-22 14:06:30 +00:00
Botond Dénes
1c7f1f16c8 Merge 'raft topology: fix group0 tombstone GC in the Raft-based recovery procedure' from Patryk Jędrzejczak
Group0 tombstone GC considers only the current group 0 members
while computing the group 0 tombstone GC time. It's not enough
because in the Raft-based recovery procedure, there can be nodes
that haven't joined the current group 0 yet, but they have belonged
to a different group 0 and thus have a non-empty group 0 state ID.
The current code can cause a data resurrection in group 0 tables.

We fix this issue in this PR and add a regression test.

This issue was uncovered by `test_raft_recovery_entry_loss`, which
became flaky recently. We skipped this test for now. We will unskip
it in a following PR because it's skipped only on master, while we
want to backport this PR.

Fixes #26534

This PR contains an important bugfix, so we should backport it
to all branches with the Raft-based recovery procedure (2025.2
and newer).

Closes scylladb/scylladb#26612

* github.com:scylladb/scylladb:
  test: test group0 tombstone GC in the Raft-based recovery procedure
  group0_state_id_handler: remove unused group0_server_accessor
  group0_state_id_handler: consider state IDs of all non-ignored topology members
2025-10-22 16:40:11 +03:00
Ernest Zaslavsky
a09ec56e3d cmake: fix s3_test linkage
Fix missing `s3_test` executable linkage with `scylla_encryption`

Closes scylladb/scylladb#26655
2025-10-22 14:14:43 +03:00
Calle Wilund
93e335f861 test::cluster::conftest: Add key_provider fixture
Iterates test functions across all mockable providers and
provides a key provider instance handling EAR setup.
2025-10-22 10:53:02 +00:00
Calle Wilund
6406879092 test::pylib::encryption_provider: Port dtest encryption provider classes
Adds virtual interface for running scylla with EAR and various providers
we can do mock for. Note: GCP KMS not implemented.
2025-10-22 10:53:02 +00:00
Petr Gusev
03d6829783 test_tablets_lwt: add test_tablets_merge_waits_for_lwt 2025-10-22 11:33:20 +02:00