Change the type from future<executor::request_return_type> to
executor::request_return_type, because the method isn't async and one
out of two callers unwraps the future immediately. This simplifies the
code a little and probably saves a few instructions, since we suspect
that moving a future<X> is more expensive than just moving X.
With this flag enabled, Alternator Streams produces more accurate event
types:
- nop operations (i.e. replacing an item with an identical one, deleting
a nonexistent item) don't produce an event,
- updates of an existing item produce a MODIFY event, instead of INSERT,
- etc.
This flag affects the internal behaviour of some operations, i.e.
Alternator may select a preimage and propagate it to CDC (in contrary to
CDC making the request), or do extra item comparisons (i.e. compare the
existing item with the new one). These operations may be costly, and
users that don't use Streams won't need them.
This flag is live-updatable. An operation reads this flag once, and uses
its value for the entire operation.
CDC log table records a mutation as a sequence of log rows that record
an atomic change (i.e. a row marker, tombstones, etc.), whereas a
mutation in Alternator Streams always appears as a single log row. The
type of operation is determined based on the type of the last log row in
CDC.
As a result, updates that create a row always appeared to Alternator
Streams as an update (row marker + data), rather than an insert. This
commit makes them a single log row. Its operation type is insert if it
contains a row marker, and an update otherwise, which gives results
consistent with DynamoDB Streams.
In commit a3ec6c7d1d we supposedly
implemented the feature of telling TTL experation events from regular
user-sent deletions. However, that implementation did not actually work
at all... It had two bugs:
1. It created an null rjson::value() instead of an empty dictionary
rjson::empty_object(), so GetRecords failed every time such a
TTL expiration event was generated.
2. In the output, it used lowercase field names "type" and "principalId"
instead of the uppercase "Type" and "PrincipalId". This is not the
correct capitalization, and when boto3 recieves such incorrect
fields it silently deletes them and never passes them to the user's
get_records() call.
This patch fixes those two bugs, and importantly - enables a test for
this feature. We did already have such a test but it was marked as
"veryslow" so doesn't run in CI and apparently not even run once to
check the new feature. This test is not actually very long on Alternator
when the TTL period is set very low (as we do in our tests), so I replaced
the "veryslow" marker by "waits_for_expiration". The latter marker means
that the test is still very slow - as much as half an hour - on DynamoDB -
but runs quickly on Scylla in our test setup, and enabled in CI by
default.
The enabled test failed badly before this patch (a server error during
GetRecords), and passes with this patch.
Also, the aforementioned commit forgot to remove the paragraph in
Alternator's compatibility.md that claims we don't have that feature yet.
So we do it now.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#26633
Until this patch, CDC haven't fetched a preimage for mutations
containing only a partition tombstone. Therefore, single-row deletions
in a table witout a clustering key didn't include a preimage, which was
inconsistent with single-row clustered deletions. This commit addresses
this inconsistency.
Second reason is compatibility with DynamoDB Streams, which doesn't
support entire-partition deletes. Alternator uses partition tombstones
for single-row deletions, though, and in these cases the 'OldImage' was
missing from REMOVE records.
Fixes https://github.com/scylladb/scylladb/issues/26382Closesscylladb/scylladb#26578
cql3: Refactor vector search select impl into a dedicated class
The motivation for this change is crash fixed in https://github.com/scylladb/scylladb/pull/25500.
This commit refactors how ANN ordered select statements are handled to prevent a potential null pointer dereference and improve code organization.
Previously, vector search selects were managed by `indexed_table_select_statement`, which unconditionally dereferenced a `view_ptr`. This assumption is invalid for vector search indexes where no view exists, creating a risk of crashes.
To address this, the refactoring introduces the following changes:
- A new `vector_indexed_table_select_statement` class is created to specifically handle ANN-ordered selects. This class operates without a view_ptr, resolving the null pointer risk.
- The `indexed_table_select_statement` is renamed to `view_indexed_table_select_statement` to more accurately reflect its function with view-based indexes.
- An assertion has been added to `indexed_table_select_statement` constructor to ensure view_ptr is not null, preventing similar issues in the future.
Fixes: VECTOR-162
No backport is needed, as this is refactoring.
Closesscylladb/scylladb#25798
* github.com:scylladb/scylladb:
cql3: Rename indexed_table_select_statement
cql3: Move vector search select to dedicated class
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixes https://github.com/scylladb/scylladb/issues/26732
backport to 2025.4 where cdc with tablets is introduced
Closesscylladb/scylladb#26160
* github.com:scylladb/scylladb:
test: cdc: extend cdc with tablets tests
cdc: improve cdc metadata loading
We currently allow creating multiple vector indexes on one column.
This doesn't make much sense as we do not support picking one when
making ann queries.
To make this less confusing and to make our behavior similar
to Cassandra we disallow the creation of multiple vector indexes
on one column.
We also add a test that checks this behavior.
Fixes: VECTOR-254
Fixes: #26672Closesscylladb/scylladb#26508
This is a follow-up for https://github.com/scylladb/scylladb/pull/26315. Fixes several review comments that were left unresolved in the original PR.
backport: not needed, this PR contains only renames and code comment fixes
Closesscylladb/scylladb#26745
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: fix comment
storage_proxy: remove stale comment
storage_proxy: improve run_fenceable_write comment
topology_coordinator: rename start_cleanup_on_dirty_nodes -> start_vnodes_cleanup_on_dirty_nodes
storage_service: rename is_cleanup_allowed -> is_vnodes_cleanup_allowed
storage_service: rename do_cluster_cleanup -> do_clusterwide_vnodes_cleanup
We migrate `tablets_test.py::TestTablets::test_moving_tablets_replica_on_node`
from dtests to the repository of Scylla. We divide the test into two
steps to make testing easier and even possible with RF-rack-valid
keyspaces being enforced.
Closesscylladb/scylladb#26285
To align with `vector_indexed_table_select_statement`, this commit renames
`indexed_table_select_statement` to `view_indexed_table_select_statement`
to clarify its usage with materialized views.
The execution of SELECT statements with ANN ordering (vector search) was
previously implemented within `indexed_table_select_statement`. This was
not ideal, as vector search logic is independent of secondary index selects.
This resulted in unnecessary complexity because vector search queries don't
use features like aggregates or paging. More importantly,
`indexed_table_select_statement` assumed a non-null `view_schema` pointer,
which doesn't hold for vector indexes (where `view_ptr` is null).
This caused null pointer dereferences during ANN ordered selects, leading
to crashes (VECTOR-179). Other parts of the class still dereference
`view_schema` without null checks.
Moving the vector search select logic out of
`indexed_table_select_statement` simplifies the code and prevents these
null pointer dereferences.
– Workload: N workers perform CAS updates
UPDATE … SET s{i}=new WHERE pk=? IF (∀j≠i: s{j}>=guard_j) AND s{i}=prev
at CL=LOCAL_QUORUM / SERIAL=LOCAL_SERIAL. Non-apply without timeout is treated
as contention; “uncertainty” timeouts are resolved via LOCAL_SERIAL read.
- Enable balancing and increase min_tablet_count to force split,
flush and lower min_tablet_count to merge.
- “Uncertainty” timeouts (write timeout due to uncertainty) are resolved via a
LOCAL_SERIAL read to determine whether the CAS actually applied.
- Invariants: after the run, for every pk and column s{i}, the stored value
equals the number of confirmed CAS by worker i (no lost or phantom updates)
despite ongoing tablet moves.
Closesscylladb/scylladb#26113
extend and improve the tests of virtual tables for cdc with tablets.
split the existing virtual tables test to one test that validates the
virtual tables against the internal cdc tables, and triggering some
tablet splits in order to create entries in the cdc_streams_history
table, and add another test with basic validation of the virtual tables
when there are multiple cdc tables.
The directory_lister uses utils::lister under the hood which accepts a
callback to put directory_entry-s in. The directory_lister's callback
then puts the entries into a queue and its .get() method pops up entries
from there to return to caller.
This patch simplifies this code by switching the directory_lister to use
experimental generator lister from seastar. With it, the entries to be
returned from .get() are simply co_await-ed from calling the generator
object (wich co_yield-s them).
As a result the directory_lister becomes smaller and drops the need for
utils::lister. Since directory_lister was created as a replacement for
that callback-based lister, the latter can be eventually removed.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26586
Recently (#26231) there was added a test to check that several API
endpoints, that return tokens and corresponding replica nodes, are
consistent with tablet map. This patch adds one more API endpoint to the
validation -- the /storage_service/tokens_endpoint one.
The extention is pretty straightforward, but the new endpoint returns
back a single (primary) replica for a token, so the test check is
slightly modified to account for that.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26580
This patch adds reproducing tests in test/alternator for issue #23438,
which is about missing checks for the length of headers and the URL
in Alternator requests. These should be limited, because Seastar's
HTTP server, which Scylla uses, reads them into memory so they can OOM
Scylla.
The tests demonstrate that DynamoDB enforces a 16 KB limit on the
headers and the URL of the request, but Scylla doesn't (a code
inspection suggests it does not in fact have any limit).
The two tests pass on DynamoDB and currently xfail on Alternator.
Refs #23438.
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23442
Sometimes file::list_directory() returns entries without type set. In
thase case lister calls file_type() on the entry name to get it. In case
the call returns disengated type, the code assumes that some error
occurred and resolves into exception.
That's not correct. The file_type() method returns disengated type only
if the file being inspected is missing (i.e. on ENOENT errno). But this
can validly happen if a file is removed bettween readdir and stat. In
that case it's not "some error happened", but a enry should be just
skipped. In "some error happened", then file_type() would resolve into
exceptional future on its own.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26595
Currently, batchlog replay is considered successful even if all batches fail
to be sent (they are replayed later). However, repair requires all batches
to be sent successfully. Currently, if batchlog isn't cleared, the repair never
learns and updates the repair_time. If GC mode is set to "repair", this means
that the tombstones written before the repair_time (minus propagation_delay)
can be GC'd while not all batches were replied.
Consider a scenario:
- Table t has a row with (pk=1, v=0);
- There is an entry in the batchlog that sets (pk=1, v=1) in table t;
- The row with pk=1 is deleted from table t;
- Table t is repaired:
- batchlog reply fails;
- repair_time is updated;
- propagation_delay seconds passes and the tombstone of pk=1 is GC'd;
- batchlog is replayed and (pk=1, v=1) inserted - data resurrection!
Do not update repair_time if sending any batch fails. The data is still repaired.
For tablet repair the repair runs, but at the end the exception is passed
to topology coordinator. Thanks to that the repair_time isn't updated.
The repair request isn't removed as well, due to which the repair will need
to rerun.
Apart from that, a batch is removed from the batchlog if its version is invalid
or unknown. The condition on which we consider a batch too fresh to replay
is updated to consider propagation_delay.
Fixes: https://github.com/scylladb/scylladb/issues/24415
Data resurrection fix; needs backport to all versions
Closesscylladb/scylladb#26319
* github.com:scylladb/scylladb:
db: fix indentation
test: add reproducer for data resurrection
repair: fail tablet repair if any batch wasn't sent successfully
db/batchlog_manager: fix making decision to skip batch replay
db: repair: throw if replay fails
db/batchlog_manager: delete batch with incorrect or unknown version
db/batchlog_manager: coroutinize replay_all_failed_batches
Before this commit, when the underlying materialized view was created,
it didn't have the property `tombstone_gc` set to any value. We fix the
bug in this PR.
Implementation strategy:
1. Move code responsible for producing the schema
of a secondary index to the file that handles
`CREATE INDEX`.
2. Set the property when creating the view.
3. Add reproducer tests.
Fixesscylladb/scylladb#26542
Backport: we can discuss it.
Closesscylladb/scylladb#26543
* github.com:scylladb/scylladb:
index: Set tombstone_gc when creating secondary index
index: Make `create_view_for_index` method of `create_index_statement`
index: Move code for creating MV of secondary index to cql3
db, cql3: Move creation of underlying MV for index
Following DynamoDB, Alternator also places a 16 MB limit on the size of
a request. Such a limit is necessary to avoid running out of memory -
because the AWS message authentication protocol requires reading the
entire request into memory before its signature can be verified.
Our implementation for this limit used Seastar's HTTP server's
content_length_limit feature. However, this Seastar feature is
incomplete - it only works when the request uses the Content-Length
header, and doesn't do anything if the request doesn't have a
Content-Length (it may use chunked encoding, or have no length at all).
So malicious users can cause Scylla to OOM by sending a huge request
without a Content-Length.
So in this patch we stop using the incomplete Seastar feature, and
implement the length limit in Scylla in a way that works correctly with
or without Content-Length: We read from the input stream and if we go
over 16MB, we generate an error.
Because we dropped Seastar's protection against a long Content-Length,
we also need to fix a piece of code which used Content-Length to reserve
some semaphore units to prevent reading many large requests in parallel.
We fix two problems in the code:
1. If Content-Length is over the limit, we shouldn't attempt to reserve
semaphore units - this should just be a Payload Too Large error.
2. If Content-Length is missing, the existing code did nothing and had
a TODO that we should. In this patch we implement what was suggested
in that TODO: We temporarily reserve the whole 16 MB limit, and
after reading the actual request, we return part of the reservation
according to the real request size.
That last fix is important, because typically the largest requests will be
BatchWriteItem where a well-written client would want to use chunked
encoding, not Content-Length, to avoid materializing the entire request
up-front. For such clients, the memory use semaphore did nothing, and
now it does the right thing.
Note that this patch does *not* solve the problem #12166 that existed
with Seastar's length-limiting implementation but still exists in the
new in-Scylla length-limiting implementation: The fact we send an
error response in the middle of the request and then close the
connection, while the client continues to send the request, can lead
to an RST being sent by the server kernel. Usually this will be fine -
well-written client libraries will be able to read the response before
the RST. But even with a well-written library in some rare timings
the client may get the RST before the response, and will miss the
response, and get an empty or partial response or "connection reset
by peer". This issue existed before this patch, and still exists, but
is probably of minor impact.
Fixes#8196
Signed-off-by: Nadav Har'El <nyh@scylladb.com>
Closesscylladb/scylladb#23434
The utils library requires OpenSSL's libcrypto for cryptographic
operations and without linking libcrypto, builds fail with undefined
symbol errors. Fix that by linking `crypto` to `utils` library when
compiled with cmake. The build files generated with configure.py already
have `crypto` lib linked, so they do not have this issue.
Fix#26705
Signed-off-by: Lakshmi Narayanan Sreethar <lakshmi.sreethar@scylladb.com>
Closesscylladb/scylladb#26707
In #24031 users complained, that trace message is truncated, namely it's
no longer json parsable and table name might not be part of the output.
This path enables users to configure maximum size of trace message.
In case user wanted `table` name, but didn't care about message size,
#26634 will help.
- add configuration varable `alternator_max_users_query_size_in_trace_output`
with default value of 4096 (4 times old default value).
- modify `truncated_content_view` function to use new configuration
variable for truncation limit
- update `truncated_content_view` to consistently truncate at given
size, previously trunctation would also happen when data arrived in
more than one chunk
- update `truncated_content_view` to better handle truncated value
(limit number of copies)
- fix `scylla_config_read` call - call to `query` for a configuration
name that is not existing will return `Items` array empty
(but present) - this would raise array access exception few lines
below.
- add test
Refs #26634
Refs #24031Closesscylladb/scylladb#26618
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets. The previous implementation unlinked SSTables immediately after streaming them for the first tablet, potentially making them partially unavailable for subsequent tablets. This patch replaces unlink() with mark_for_deletion() deferring actual unlinking till sstable::close_files.
test_tablets2::test_tablet_load_and_stream was enhanced to also verify that SSTables are removed after being streamed.
Fixes#26606
Backport is not required, although it is a bug fix, but it isn't visible. This is more of a preparatory fix for https://github.com/scylladb/scylladb/pull/26444.
Closesscylladb/scylladb#26622
* github.com:scylladb/scylladb:
test_tablets2: verify SSTable cleanup after tablet load and stream
tablet_sstable_streamer: replace unlink() call with mark_for_deletion()
It turns out that #21477 wasn't sufficient to fix the issue. The driver
may still decide to reconnect the connection after `rolling_restart`
returns. One possible explanation is that the driver sometimes handles
the DOWN notification after all nodes consider each other UP.
Reconnecting the driver after restarting nodes seems to be a reliable
workaround that many tests use. We also use it here.
Fixes#19959Closesscylladb/scylladb#26638
With the recent introduction of retry_strategy to Seastar, the pure virtual class previously defined in ScyllaDB is now redundant. This change allows us to streamline our codebase by directly inheriting from Seastar’s implementation, eliminating duplication in ScyllaDB.
Despite this update is purely a refactoring effort and does not introduce functional changes it should be ported back to 2025.3 and 2025.4 otherwise it will make future backports of bugfixes/improvements related to `s3_client` near to impossible
ref: https://github.com/scylladb/seastar/issues/2803
depends on: https://github.com/scylladb/seastar/pull/2960Closesscylladb/scylladb#25801
* github.com:scylladb/scylladb:
s3_client: remove unnecessary `co_await` in `make_request`
s3 cleanup: remove obsolete retry-related classes
s3_client: remove unused `filler_exception`
s3_client: fix indentation
s3_client: simplify chunked download error handling using `make_request`
s3_client: reformat `make_request` functions for readability
s3_client: eliminate duplication in `make_request` by using overload
s3_client: reformat `make_request` function declarations for readability
s3_client: reorder `make_request` and helper declarations
s3_client: add `make_request` override with custom retry and error handler
s3_client: migrate s3_client to Seastar HTTP client
s3_client: fix crash in `copy_s3_object` due to dangling stream
s3_client: coroutinize `copy_s3_object` response callback
aws_error: handle missing `unexpected_status_error` case
s3_creds: use Seastar HTTP client with retry strategy
retry_strategy: add exponential backoff to `default_aws_retry_strategy`
retry_strategy: introduce Seastar-based retry strategy
retry_strategy: update CMake and configure.py for new strategy
retry_strategy: rename `default_retry_strategy` to `default_aws_retry_strategy`
retry_strategy: fix include
retry_strategy: Copied utils/s3/retry_strategy.hh to utils/s3/default_aws_retry_strategy.hh
retry_strategy: Copied utils/s3/retry_strategy.cc to utils/s3/default_aws_retry_strategy.cc
When a tablet is migrated between shards on the same node, during the write_both_read_new state we begin switching reads to the new shard. Until the corresponding global barrier completes, some requests may still use write_both_read_old erm, while others already use the write_both_read_new erm. To ensure mutual exclusion between these two types of requests, we must acquire locks on both the old and new shards. Once the global barrier completes, no requests remain on the old shard, so we can safely switch to acquiring locks only on the new shard.
The idea came from the similar locking problem in the [counters for tablets PR](https://github.com/scylladb/scylladb/pull/26636#discussion_r2463932395).
Fixesscylladb/scylladb#26727
backport: need to backport to 2025.4
Closesscylladb/scylladb#26719
* https://github.com/scylladb/scylladb:
paxos_state: use shards_ready_for_reads
paxos_state: inline shards_for_writes into get_replica_lock
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
Fixes https://github.com/scylladb/scylladb/issues/25341Closesscylladb/scylladb#25456
* github.com:scylladb/scylladb:
mv: limit concurrent view updates from all sources
database: rename _view_update_concurrency_sem to _view_update_memory_sem
Currently, tablet_sstable_streamer::get_primary_endpoints is out of
sync with tablet_map::get_primary_replica. The get_primary_replica
optimizes the choice of the replica so that the work is fairly
distributes among nodes. Meanwhile, get_primary_endpoints always
chooses the first replica.
Use get_primary_replica for get_primary_endpoints.
Fixes: https://github.com/scylladb/scylladb/issues/21883.
Closesscylladb/scylladb#26385
when loading CDC streams metadata for tablets from the tables, read only
new entries from the history table instead of reading all entries. This
improves the CDC metadata reloading, making it more efficient and
predictable.
the CDC metadata is loaded as part of group0 reload whenever the
internal CDC tables are modified. on tablet split / merge, we create a
new CDC timestamp and streams by writing them to the cdc_streams_history
table by group0 operation, and when it's applied we reload the in-memory
CDC streams map by reading from the tables and constructing the updated map.
Previously, on every update, we would read the entire
cdc_streams_history entries for the changed table, constructing all its
streams and creating a new map from scratch.
We improve this now by reading only new entries from cdc_streams_history
and append them to the existing map. we can do this because we only
append new entries to cdc_streams_history with higher timestamp than all
previous entries.
This makes this reloading more efficient and predictable, because
previously we would read a number of entries that depends on the number
of tablets splits and merges, which increases over time and is
unbounded, whereas now we read only a single stream set on each update.
Fixesscylladb/scylladb#26732
Before this patch, when a base table has many materialized views,
each write to this table can start up to 128 view updates in parallel.
With high client write concurrency, the actual concurrency of writes
executed on the node may grow unexpectedly, which can lead to higher
latency and higher memory usage compared to a sequential approach.
In this patch we add a per-shard, per-service-level semaphore which
limits the number of concurrent view updates processed on the shard
in this service level to a constant value. We take one unit from the
semaphore for each local view update write, and releasing it when it
finishes. The remote view updates do not take units from the semaphore
because they don't consume nearly as much processing power and they
are limited by another semaphore based on their memory usage.
The effect of this patch can also be observed when writing to a base
table with a large number of materialized views, like in the
materialized_views_test.py::TestMaterializedViews::test_many_mv_concurrent
dtest. In that test, if we perform a full scan in parallel to a write
workload with a concurrency of 100 to a table with 100 views, the scan
would sometimes timeout because it would effectively get 1/10000 of cpu.
With this patch, the cpu concurrency of view updates was limited to 128
(we ran both writes and scan in the same service level), and the scan
no longer timed out.
Fixes https://github.com/scylladb/scylladb/issues/25341
The std::optional<T> inject_parameter(...) method is a template, and in
dev/debug modes this parameter is defaulted to std::string_view, but for
release mode it's not. This patch makes it symmetrical.
Signed-off-by: Pavel Emelyanov <xemul@scylladb.com>
Closesscylladb/scylladb#26706
When streaming SSTables across tablets, a single SSTable may be streamed to multiple tablets.
The previous implementation unlinked SSTables immediately after streaming them for the first tablet,
potentially making them partially unavailable for subsequent tablets.
This patches replaces unlink() call with mark_for_deletion()
Acquiring locks on both shards for the entire tablet migration period
is redundant. In most cases, locking only the old shard or only the new
shard is sufficient. Using shards_ready_for_reads reduces the
situations in which we need to lock both shards to:
* intra-node migrations only
* only during the write_both_read_new state
Once the global barrier completes in the write_both_read_new state, no
requests remain on the old shard, so we can safely acquire locks
only on the new shard.
Fixesscylladb/scylladb#26727
introduce helper functions that can be used for garbage collecting old
cdc streams for tablets-based keyspaces.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.
the garbage collection works by finding the newest cdc timestamp that has been
closed for more than the configured cdc TTL, and removing all information from
the cdc internal tables about cdc timestamps and streams up to this timestamp.
in general it should be safe to remove information about these streams because
they are closed for more than TTL, therefore all rows that were written to these streams
with the configured TTL should be dead.
the exception is if the TTL is altered to a smaller value, and then we may remove information
about streams that still have live rows that were written with the longer ttl.
Fixes https://github.com/scylladb/scylladb/issues/26669Closesscylladb/scylladb#26410
* github.com:scylladb/scylladb:
cdc: garbage collect CDC streams periodically
cdc: helpers for garbage collecting old streams for tablets
Currently when a null vector is passed to an ANN query we fail with a
quite confusing error ("NoHostAvailable: ('Unable to complete the
operation against any hosts', {<Host: 127.0.0.1:9042 datacenter1>:
<Error from server: code=0000 [Server error] message="to_bytes() called
on raw value that is null">})").
This patch fixes that by throwing an InvalidRequestException with an
appropriate message instead.
We also add a test case that validates this behavior.
Fixes: VECTOR-257
Closesscylladb/scylladb#26510
Fixes#26641
* Adds shared abstraction for dockerized mock services for out pytests (not using python docker, due to both library and podman)
* Adds test fixtures for our key providers (except GCS KMS, for which we have no mock server) to do local testing
* Ports (and prunes and sharpens) the test cases from dtest::encryption_at_rest_test to our pytest.
* Shared KMIP mock between boost test and pytest and speeds up boost test shutdown.
When merged, the dtest counterpart can be decommissioned.
Closesscylladb/scylladb#26642
* github.com:scylladb/scylladb:
test::cluster::object_store::conftest: Make GS proxy use shared docker mock server wrapper
test::cluster::test_encryption: Port dtest EAR tests
test::cluster::conftest: Add key_provider fixture
test::pylib::encryption_provider: Port dtest encryption provider classes
test::pylib::dockerized_service: Add helper for running docker/podman
test::pylib::kmip_wrapper: Modify to be usable by pytest fixtures
test::boost::kmip_wrapper: Move python script for PyKMIP to pylib
Problems addressed by this PR
* Missing barrier before cleanup: If a node was bootstrapped before cleanup, some request coordinators could still be in `write_both_read_new` and send stale requests to replicas being cleaned up.
* Sessions not drained before cleanup: We lacked protection against stale streaming or repair operations.
* `sstable_vnodes_cleanup_fiber()` calling `flush_all_tables()` under group0 lock: This caused SCT test failures (see [this comment](https://github.com/scylladb/scylladb/issues/25333#issuecomment-3298859046) for details).
* Issues with `storage_proxy::start_write()` used by `sstable_vnodes_cleanup_fiber`:
* The result of `start_write()` was not held during `abstract_write_response_handler::apply_locally`, so coordinator-local writes were not properly awaited.
* Synchronization was racy — `start_write()` was not atomic with the fence check, allowing stale writes to sneak in if `fence_version` changed in between.
* It waited for all writes, including local tables and tablet-based tables, which is redundant because `sstable_vnodes_cleanup_fiber` does not apply to them.
* It also waited for writes with versions greater than the current `fence_version`, which is unnecessary.
Fixesscylladb/scylladb#26150
backport: this PR fixes several issues with the vnodes cleanup procedure, but it doesn't seem they are critical enough to deserve backporting
Closesscylladb/scylladb#26315
* https://github.com/scylladb/scylladb:
test_automatic_cleanup: add test_cleanup_waits_for_stale_writes
test_fencing: fix due to new version increment
test_automatic_cleanup: clean it up
storage_proxy: wait for closing sessions in sstable cleanup fiber
storage_proxy: rename await_pending_writes -> await_stale_pending_writes
storage_proxy: use run_fenceable_write
storage_proxy: abstract_write_response_handler: apply_locally: extract post fence check
storage_proxy: introduce run_fenceable_write
storage_proxy: move update_fence_version from shared_token_metadata
storage_proxy: fix start_write() operation scope in apply_locally
storage_proxy: move post fence check into handle_write
storage_proxy: move fencing into mutate_counter_on_leader_and_replicate
storage_proxy::handle_read: add fence check before get_schema
storage_service: rebrand cleanup_fiber to vnodes_cleanup_fiber
sstable_cleanup_fiber: use coroutine::parallel_for_each
storage_service: sstable_cleanup_fiber: move flush_all_tables out of the group0 lock
topology_coordinator: barrier before cleanup
topology_coordinator: small start_cleanup refactoring
global_token_metadata_barrier: add fenced flag
No need to have two functions since both callers of get_replica_lock()
use shards_for_writes() to compute the shards where the locks
must be acquired.
Also while at it, inline the acquire() lambda in get_replica_lock()
and replace it with a loop over shards. This makes the code
more strightforward.
add a background fiber to the topology coordinator that runs
periodically and checks for old CDC streams for tablets keyspaces that
can be garbage collected.